<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>INT8 on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/int8/</link>
    <description>Recent content in INT8 on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Fri, 05 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/int8/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>嵌入式 NPU 架构与算子优化实战：从内存带宽到 INT8 部署</title>
      <link>https://tech-snippets.xyz/posts/embedded-npu-architecture-operator-optimization-guide/</link>
      <pubDate>Fri, 05 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/embedded-npu-architecture-operator-optimization-guide/</guid>
      <description>前言：为什么同一个模型在不同 NPU 上差距很大？ 做嵌入式 AI 部署时，很多人第一次拿到 NPU 板卡都会有一个误解：只要芯片宣传页写着 1TOPS、6TOPS 或 10TOPS，模型就应该按照这个数字线性变快。实际项目里经常不是这样。同样一个 YOLO、MobileNet 或语音关键词模型，在 A 芯片上跑得很顺，在 B 芯片上却卡在某几个算子；同样是 INT8 量化，有的模型精度几乎不掉，有的模型会出现明显误检；同样是官方转换工具，有的网络一键通过，有的网络需要反复改 ONNX 图、替换算子、拆分子图。
这些问题并不神秘，本质上是 NPU 的计算阵列、片上 SRAM、DMA、数据布局、编译器和运行时之间存在非常强的耦合。CPU 代码慢了，我们通常先看热点函数；GPU 程序慢了，会看 kernel occupancy、显存访问和线程块；NPU 部署慢了，也要有类似的分析框架：先判断瓶颈是算力、带宽、算子支持、量化误差，还是 CPU/NPU 之间的调度开销。
本文从工程视角拆解嵌入式 NPU 的典型架构，并围绕一个真实部署流程展开：模型导出、图优化、量化校准、算子映射、内存规划、运行时流水线和性能排查。文章不绑定某一家芯片，但会覆盖 RK、Amlogic、Kendryte、寒武纪边缘模块以及很多 MCU 级 NPU 都会遇到的共性问题。读完后，你应该能判断一个模型为什么没有跑满 NPU，也能知道该从哪里下手优化。
一、先把 TOPS 的含义说清楚 TOPS 是每秒万亿次操作数，通常用于描述 INT8 乘加能力。例如一个 2TOPS 的 NPU，理论上每秒可以完成 2 万亿次 8 bit 整数运算。问题在于，这个数字往往是理想条件下的峰值：输入输出都在合适的数据布局中，算子可以完全映射到矩阵乘阵列，片上缓存命中率足够高，DMA 搬运没有拖后腿，调度器没有频繁切换任务。
在实际模型里，真正能高效利用 NPU 的通常是卷积、深度卷积、全连接、矩阵乘、部分池化和激活函数。很多看起来不起眼的操作，例如 Reshape、Transpose、Slice、Gather、Resize、NonMaxSuppression，如果不能被 NPU 原生支持，就可能回退到 CPU。一次 CPU 回退不仅带来计算时间，还可能带来缓存同步、数据格式转换和内存拷贝。模型中只要有几个这样的“断点”，端到端延迟就会明显变差。
评估 NPU 时，比 TOPS 更有价值的是下面几个指标：</description>
      <content:encoded><![CDATA[<h2 id="前言为什么同一个模型在不同-npu-上差距很大">前言：为什么同一个模型在不同 NPU 上差距很大？</h2>
<p>做嵌入式 AI 部署时，很多人第一次拿到 NPU 板卡都会有一个误解：只要芯片宣传页写着 1TOPS、6TOPS 或 10TOPS，模型就应该按照这个数字线性变快。实际项目里经常不是这样。同样一个 YOLO、MobileNet 或语音关键词模型，在 A 芯片上跑得很顺，在 B 芯片上却卡在某几个算子；同样是 INT8 量化，有的模型精度几乎不掉，有的模型会出现明显误检；同样是官方转换工具，有的网络一键通过，有的网络需要反复改 ONNX 图、替换算子、拆分子图。</p>
<p>这些问题并不神秘，本质上是 NPU 的计算阵列、片上 SRAM、DMA、数据布局、编译器和运行时之间存在非常强的耦合。CPU 代码慢了，我们通常先看热点函数；GPU 程序慢了，会看 kernel occupancy、显存访问和线程块；NPU 部署慢了，也要有类似的分析框架：先判断瓶颈是算力、带宽、算子支持、量化误差，还是 CPU/NPU 之间的调度开销。</p>
<p>本文从工程视角拆解嵌入式 NPU 的典型架构，并围绕一个真实部署流程展开：模型导出、图优化、量化校准、算子映射、内存规划、运行时流水线和性能排查。文章不绑定某一家芯片，但会覆盖 RK、Amlogic、Kendryte、寒武纪边缘模块以及很多 MCU 级 NPU 都会遇到的共性问题。读完后，你应该能判断一个模型为什么没有跑满 NPU，也能知道该从哪里下手优化。</p>
<p><img alt="嵌入式 NPU 算子执行流水线" loading="lazy" src="/images/embedded-npu-operator-pipeline.svg"></p>
<h2 id="一先把-tops-的含义说清楚">一、先把 TOPS 的含义说清楚</h2>
<p>TOPS 是每秒万亿次操作数，通常用于描述 INT8 乘加能力。例如一个 2TOPS 的 NPU，理论上每秒可以完成 2 万亿次 8 bit 整数运算。问题在于，这个数字往往是理想条件下的峰值：输入输出都在合适的数据布局中，算子可以完全映射到矩阵乘阵列，片上缓存命中率足够高，DMA 搬运没有拖后腿，调度器没有频繁切换任务。</p>
<p>在实际模型里，真正能高效利用 NPU 的通常是卷积、深度卷积、全连接、矩阵乘、部分池化和激活函数。很多看起来不起眼的操作，例如 <code>Reshape</code>、<code>Transpose</code>、<code>Slice</code>、<code>Gather</code>、<code>Resize</code>、<code>NonMaxSuppression</code>，如果不能被 NPU 原生支持，就可能回退到 CPU。一次 CPU 回退不仅带来计算时间，还可能带来缓存同步、数据格式转换和内存拷贝。模型中只要有几个这样的“断点”，端到端延迟就会明显变差。</p>
<p>评估 NPU 时，比 TOPS 更有价值的是下面几个指标：</p>
<ol>
<li><strong>端到端延迟</strong>：从图像采集或音频帧输入，到最终结果输出的总耗时。</li>
<li><strong>NPU 子图覆盖率</strong>：模型中有多少算子真正被 NPU 执行。</li>
<li><strong>DDR 带宽占用</strong>：输入、输出、中间特征图是否频繁进出外部内存。</li>
<li><strong>Batch 与分辨率敏感性</strong>：嵌入式场景多为 batch=1，很多服务端优化不适用。</li>
<li><strong>量化后精度</strong>：INT8 的 mAP、Top-1、误唤醒率是否满足业务要求。</li>
<li><strong>功耗与温升</strong>：持续推理 30 分钟后频率是否降档。</li>
</ol>
<p>如果只看峰值 TOPS，很容易把问题归因到“芯片不行”。但很多时候，真正的问题是模型图不适合该 NPU，或者预处理和后处理拖慢了整条流水线。</p>
<h2 id="二嵌入式-npu-的典型硬件结构">二、嵌入式 NPU 的典型硬件结构</h2>
<p>不同厂商的实现细节不同，但嵌入式 NPU 通常可以抽象成几个模块：MAC 阵列、片上 SRAM、DMA 控制器、指令调度器、数据重排单元和外部 DDR 接口。</p>
<p>MAC 阵列负责核心乘加。卷积在编译阶段会被转换为矩阵乘或滑窗乘加任务，再切成多个 tile 放入阵列。片上 SRAM 保存权重块、输入块和输出块，避免每次乘加都访问 DDR。DMA 负责在 DDR 和 SRAM 之间搬运数据。数据重排单元负责处理 NCHW、NHWC、NC1HWC2 等布局转换。指令调度器则把编译器生成的命令流按照依赖关系送入硬件。</p>
<p>一个简化的卷积执行过程如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">DDR 中的输入特征图 -&gt; DMA 搬入片上 SRAM
</span></span><span class="line"><span class="cl">DDR 中的权重块     -&gt; DMA 搬入片上 SRAM
</span></span><span class="line"><span class="cl">MAC 阵列执行 tile 卷积
</span></span><span class="line"><span class="cl">局部输出写回 SRAM
</span></span><span class="line"><span class="cl">必要时做激活/量化/累加
</span></span><span class="line"><span class="cl">最终输出通过 DMA 写回 DDR
</span></span></code></pre></div><p>这里的关键是“tile”。片上 SRAM 容量有限，不可能一次放下完整的高分辨率特征图和全部权重。编译器需要根据 SRAM 大小、阵列形状、数据类型和算子参数，把一个大算子切成许多小块。tile 切得太小，DMA 和调度开销变大；tile 切得太大，SRAM 放不下或复用率下降。很多 NPU 的性能差异，表面看是 TOPS 不同，深层其实是 tile 策略、数据布局和内存层次做得好不好。</p>
<h2 id="三从模型图看-npu-友好程度">三、从模型图看 NPU 友好程度</h2>
<p>在部署前，建议先用 Netron、ONNX GraphSurgeon 或厂商工具查看模型图。一个 NPU 友好的模型通常具备这些特征：主干网络由 Conv、BN、ReLU/SiLU、Pooling、MatMul 等常见算子组成；分支结构不太复杂；动态 shape 很少；后处理可以拆到 CPU，并且数据量已经足够小；输入分辨率固定；没有大量 <code>Transpose</code> 和 <code>Gather</code>。</p>
<p>以目标检测模型为例，主干和 neck 往往很好映射到 NPU，但 decode 和 NMS 经常是麻烦点。很多模型导出 ONNX 后，会把网格生成、坐标变换、阈值过滤和 NMS 都留在图里。这样虽然在 PC 上用 ONNX Runtime 很方便，但在嵌入式 NPU 上可能导致大量 CPU 回退。更稳妥的做法是让 NPU 输出三个尺度的特征图，后处理在 C/C++ 中单独实现。</p>
<p>另一个常见问题是激活函数。传统 ReLU 对 NPU 友好，但 Swish、GELU、HardSwish 的支持情况因芯片而异。有些 NPU 可以融合 <code>Conv + BN + ReLU</code>，但不能很好地融合 <code>Conv + BN + SiLU</code>。如果模型允许，训练阶段就应考虑部署端约束，而不是等模型训练完再强行适配。</p>
<p>（第一部分完，约2300字）</p>
<h2 id="四模型转换前的图优化少一个-transpose-就少一次搬运">四、模型转换前的图优化：少一个 Transpose 就少一次搬运</h2>
<p>NPU 编译器通常会做常量折叠、算子融合、死节点删除和布局传播，但不要指望它解决所有问题。工程上更可靠的做法，是在导出 ONNX 后主动清理模型图。尤其是 PyTorch 导出的图，经常会因为框架表达方式留下多余的 <code>Unsqueeze</code>、<code>Concat</code>、<code>Slice</code>、<code>Transpose</code>，这些节点在桌面端不明显，在 NPU 上却可能成为性能断点。</p>
<p>下面是一个简单的 ONNX 检查脚本，用来统计算子类型和可疑节点：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">onnx</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">onnx</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">&#34;model.onnx&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">ops</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">op_type</span> <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">graph</span><span class="o">.</span><span class="n">node</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">op</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">ops</span><span class="o">.</span><span class="n">most_common</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">op</span><span class="si">:</span><span class="s2">20s</span><span class="si">}</span><span class="s2"> </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">suspect</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;Transpose&#34;</span><span class="p">,</span> <span class="s2">&#34;Gather&#34;</span><span class="p">,</span> <span class="s2">&#34;Slice&#34;</span><span class="p">,</span> <span class="s2">&#34;Resize&#34;</span><span class="p">,</span> <span class="s2">&#34;NonMaxSuppression&#34;</span><span class="p">,</span> <span class="s2">&#34;Shape&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">可疑算子：&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">graph</span><span class="o">.</span><span class="n">node</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">node</span><span class="o">.</span><span class="n">op_type</span> <span class="ow">in</span> <span class="n">suspect</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">node</span><span class="o">.</span><span class="n">op_type</span><span class="p">,</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">node</span><span class="o">.</span><span class="n">input</span><span class="p">],</span> <span class="s2">&#34;-&gt;&#34;</span><span class="p">,</span> <span class="p">[</span><span class="n">o</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">node</span><span class="o">.</span><span class="n">output</span><span class="p">])</span>
</span></span></code></pre></div><p>如果发现 <code>Transpose</code> 数量很多，要进一步看它们是否只是为了在 NCHW 和 NHWC 之间来回切换。有些转换工具要求固定输入布局，导出时如果设置错了，就会在图首和图尾插入额外重排。重排操作本身不做复杂计算，却会读写完整特征图；对于 640×640 的检测模型，中间层特征图可能有数 MB，几次重排就足以吞掉大量带宽。</p>
<p>图优化的基本原则有三条：</p>
<ul>
<li><strong>能融合就融合</strong>：Conv + BN + Activation 应尽量在转换前或编译器中融合。</li>
<li><strong>能静态就静态</strong>：固定输入尺寸、固定 batch，避免动态 shape 进入 NPU 子图。</li>
<li><strong>能移出就移出</strong>：NMS、字符串处理、复杂索引等不适合 NPU 的逻辑放到 CPU 后处理。</li>
</ul>
<p>以 BatchNorm 融合为例，推理阶段 BN 可以合并进卷积权重和偏置。公式并不复杂：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># W, b: 原卷积权重和偏置</span>
</span></span><span class="line"><span class="cl"><span class="c1"># mean, var, gamma, beta: BN 参数</span>
</span></span><span class="line"><span class="cl"><span class="c1"># eps: BN epsilon</span>
</span></span><span class="line"><span class="cl"><span class="n">scale</span> <span class="o">=</span> <span class="n">gamma</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="n">eps</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">W_fused</span> <span class="o">=</span> <span class="n">W</span> <span class="o">*</span> <span class="n">scale</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">b_fused</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">*</span> <span class="n">scale</span>
</span></span></code></pre></div><p>融合后，运行时少了一个 BN 算子，也减少了一次中间特征图读写。对大模型来说，这种看似基础的优化非常重要，因为嵌入式推理经常不是算力不够，而是内存访问太贵。</p>
<h2 id="五int8-量化性能提升背后的精度账">五、INT8 量化：性能提升背后的精度账</h2>
<p>多数嵌入式 NPU 的高性能路径是 INT8。FP16 或 BF16 支持正在变多，但在低功耗设备上，INT8 仍是性价比最高的部署方式。量化的核心是把浮点张量映射到整数范围：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">real_value ≈ scale × (int_value - zero_point)
</span></span></code></pre></div><p>对于权重量化，常见做法是 per-channel；对于激活量化，常见做法是 per-tensor。per-channel 能显著改善卷积权重量化精度，因为不同输出通道的权重分布可能差异很大。激活则依赖校准数据集，用一批代表性输入统计每层的数值范围。</p>
<p>校准集不是随便找几十张图就行。它应该覆盖真实场景中的亮度、角度、背景、目标尺度和噪声。如果是工业检测，要包含正常品、轻微缺陷、严重缺陷和空载图；如果是语音唤醒，要覆盖不同说话人、距离、噪声和麦克风增益。校准集偏了，量化 scale 就会偏，最终表现为某些场景下误差突然变大。</p>
<p>下面是一个用于挑选图像校准集的小脚本思路：从原始数据中按亮度和边缘密度分桶，避免校准样本都集中在同一种光照条件。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">cv2</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">items</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">&#34;calib_raw&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">&#34;*.jpg&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">img</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">imread</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">p</span><span class="p">),</span> <span class="n">cv2</span><span class="o">.</span><span class="n">IMREAD_GRAYSCALE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">img</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">continue</span>
</span></span><span class="line"><span class="cl">    <span class="n">brightness</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">img</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">    <span class="n">edges</span> <span class="o">=</span> <span class="n">cv2</span><span class="o">.</span><span class="n">Canny</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">160</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">edge_density</span> <span class="o">=</span> <span class="nb">float</span><span class="p">((</span><span class="n">edges</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">    <span class="n">items</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">p</span><span class="p">,</span> <span class="n">brightness</span><span class="p">,</span> <span class="n">edge_density</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 简单分桶：亮度 5 桶，纹理 5 桶，每桶最多取 4 张</span>
</span></span><span class="line"><span class="cl"><span class="n">selected</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"><span class="n">buckets</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">p</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">key</span> <span class="o">=</span> <span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">b</span> <span class="o">/</span> <span class="mi">51</span><span class="p">),</span> <span class="mi">4</span><span class="p">),</span> <span class="nb">min</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">e</span> <span class="o">/</span> <span class="mf">0.04</span><span class="p">),</span> <span class="mi">4</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">buckets</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="p">[])</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">paths</span> <span class="ow">in</span> <span class="n">buckets</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">selected</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">paths</span><span class="p">[:</span><span class="mi">4</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">&#34;selected&#34;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">selected</span><span class="p">))</span>
</span></span></code></pre></div><p>量化后一定要做逐层对比。只看最终 mAP 或准确率，很难定位问题。更有效的方式是让浮点模型和 INT8 模型输出若干关键层，计算余弦相似度、均方误差和最大误差。如果某一层开始相似度骤降，就重点检查该层前后的激活范围、是否有异常 outlier、是否有不适合量化的操作。</p>
<h2 id="六算子映射让模型顺着硬件走">六、算子映射：让模型顺着硬件走</h2>
<p>NPU 编译器会把模型图拆成一个或多个子图。连续可支持的算子会进入 NPU 子图，不支持的算子留给 CPU 或 DSP。性能优化的目标，就是让大的计算段尽量连续地留在 NPU 上。</p>
<p>举个常见例子：<code>Conv -&gt; BN -&gt; SiLU -&gt; Add</code>。如果某个 NPU 不支持 SiLU 融合，编译器可能把 <code>Conv + BN</code> 放进 NPU，把 SiLU 放到 CPU，再把 Add 放回 NPU。这样中间特征图要从 NPU 内存写到 DDR，再由 CPU 读取处理，然后又写回给 NPU，代价非常高。此时可以考虑把 SiLU 替换为 HardSwish 或 ReLU6，或者在训练时使用部署端支持更好的激活函数。</p>
<p>再比如深度可分离卷积。它理论上计算量低，但对某些 NPU 并不一定更快，因为 depthwise conv 的数据复用率不如普通卷积，容易变成带宽瓶颈。移动端网络里常见的 1×1 pointwise conv 反而更容易跑满矩阵阵列。所以模型结构选择不能只看 FLOPs，还要结合目标硬件的算子效率表。</p>
<p>实践中可以建立一张“算子白名单”：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">preferred_ops</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Conv</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Relu</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">LeakyRelu</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Add</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Mul</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">MaxPool</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">AveragePool</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">GlobalAveragePool</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">MatMul</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w"></span><span class="nt">avoid_ops</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">NonMaxSuppression</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">DynamicQuantizeLinear</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">GatherND</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">GridSample</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Resize(mode=cubic)</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="l">Transpose(large_feature_map)</span><span class="w">
</span></span></span></code></pre></div><p>这张表不是固定的，要根据具体芯片 SDK 更新。每次 SDK 升级后，都建议重新跑一遍模型转换报告，看看原本回退 CPU 的节点是否已经支持，或者原本支持的融合是否发生变化。</p>
<p>（第二部分完，约2500字）</p>
<h2 id="七内存规划嵌入式-npu-最容易被低估的战场">七、内存规划：嵌入式 NPU 最容易被低估的战场</h2>
<p>很多模型理论计算量不大，却在板子上跑不快，根因是内存。嵌入式 SoC 的 DDR 同时服务 CPU、GPU、ISP、VPU、NPU 和显示控制器。摄像头预览、视频编码、神经网络推理如果同时进行，DDR 带宽会被迅速吃满。NPU 峰值算力再高，数据喂不上去也只能空转。</p>
<p>优化内存时，先看三类数据：输入输出、中间特征图、权重。权重通常可以常驻内存，有些平台还能预加载到专用区域；输入输出与业务流水线有关，例如摄像头 NV12 数据是否需要转 RGB、是否需要 resize、是否能使用 zero-copy；中间特征图由编译器规划，但模型结构会影响峰值内存。</p>
<p>对视觉模型来说，预处理经常被忽略。一个 1080p 摄像头输入，如果每帧都由 CPU 做 NV12 到 RGB、resize、归一化，再拷贝到 NPU 输入缓冲区，可能预处理就花掉 8 到 15ms。更好的方式是使用 ISP/RGA/GPU 做颜色转换和缩放，或者让 NPU runtime 接受硬件缓冲区句柄，减少内存拷贝。</p>
<p>典型的零拷贝流水线如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Camera DMA buffer
</span></span><span class="line"><span class="cl">  -&gt; 硬件缩放/颜色转换
</span></span><span class="line"><span class="cl">  -&gt; NPU 输入 buffer
</span></span><span class="line"><span class="cl">  -&gt; NPU 推理
</span></span><span class="line"><span class="cl">  -&gt; CPU 读取小尺寸输出
</span></span><span class="line"><span class="cl">  -&gt; 后处理与业务逻辑
</span></span></code></pre></div><p>如果平台支持 ION、DMA-BUF 或类似机制，尽量让视频帧在硬件模块之间传递句柄，而不是在用户态反复 <code>memcpy</code>。这类优化写起来不如改模型显眼，但对端到端延迟和功耗非常有效。</p>
<h2 id="八运行时流水线不要让-npu-等-cpu">八、运行时流水线：不要让 NPU 等 CPU</h2>
<p>单帧推理流程通常包括采集、预处理、推理、后处理和显示/通信。如果这些步骤串行执行，总延迟是所有步骤相加。实际产品中可以用流水线并行：CPU 后处理第 N 帧时，NPU 推理第 N+1 帧，ISP 准备第 N+2 帧。这样单帧延迟没有消失，但系统吞吐会明显提升。</p>
<p>下面是一个简化的 C++ 伪代码，展示三线程流水线的结构：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="k">struct</span> <span class="nc">FrameJob</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">id</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">Buffer</span> <span class="n">input</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">Buffer</span> <span class="n">tensor</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">NpuOutput</span> <span class="n">output</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">BlockingQueue</span><span class="o">&lt;</span><span class="n">FrameJob</span><span class="o">&gt;</span> <span class="n">q_pre</span><span class="p">,</span> <span class="n">q_infer</span><span class="p">,</span> <span class="n">q_post</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">capture_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">FrameJob</span> <span class="n">job</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">id</span><span class="o">++</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">input</span> <span class="o">=</span> <span class="n">camera_dequeue</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">q_pre</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">preprocess_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">job</span> <span class="o">=</span> <span class="n">q_pre</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">tensor</span> <span class="o">=</span> <span class="n">hw_resize_color_convert</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">input</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">q_infer</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">infer_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">job</span> <span class="o">=</span> <span class="n">q_infer</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="n">job</span><span class="p">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">npu_run</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">tensor</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">q_post</span><span class="p">.</span><span class="n">push</span><span class="p">(</span><span class="n">job</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">postprocess_thread</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">running</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">job</span> <span class="o">=</span> <span class="n">q_post</span><span class="p">.</span><span class="n">pop</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="k">auto</span> <span class="n">result</span> <span class="o">=</span> <span class="n">decode_and_nms</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">output</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">publish_result</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">camera_release</span><span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">input</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里有几个工程细节：队列长度不能无限增长，否则实时系统会堆积旧帧；如果业务只关心最新结果，可以在队列满时丢弃旧帧；NPU 输入输出 buffer 应该复用，避免每帧申请释放；多线程之间要注意 cache flush/invalidate，特别是 CPU 和 NPU 共享物理内存时。</p>
<h2 id="九性能排查按层耗时比总耗时更重要">九、性能排查：按层耗时比总耗时更重要</h2>
<p>当模型跑得慢时，不要先猜。第一步是拿到转换报告和 profiler。大多数 NPU SDK 都能输出每层耗时、子图划分、内存使用和 CPU fallback 信息。如果工具不够完善，也可以在业务侧用时间戳包住预处理、推理和后处理，至少先拆出大方向。</p>
<p>建议建立如下排查顺序：</p>
<ol>
<li><strong>确认频率和温度</strong>：是否因为供电、散热或 governor 导致降频。</li>
<li><strong>确认模型输入尺寸</strong>：是否误用了更大分辨率或动态 shape。</li>
<li><strong>查看 NPU 子图覆盖率</strong>：是否存在 CPU fallback。</li>
<li><strong>查看最慢的前 10 层</strong>：是大卷积、Resize、Transpose，还是后处理。</li>
<li><strong>检查预处理耗时</strong>：是否被 CPU resize 和 memcpy 拖慢。</li>
<li><strong>检查量化模式</strong>：是否有部分层保持 FP32，导致混合执行成本增加。</li>
<li><strong>对比官方 benchmark</strong>：用厂商示例确认环境没有系统性问题。</li>
</ol>
<p>如果某一层卷积异常慢，可以尝试改变输入尺寸、通道数或模型结构。NPU 对通道对齐很敏感，有些硬件喜欢通道数是 8、16 或 32 的倍数。如果网络中出现 3、5、7 这类不规则通道，编译器可能需要 padding，导致实际计算和内存都增加。训练模型时适当让通道数对齐硬件粒度，通常比部署后硬优化更划算。</p>
<h2 id="十一个可落地的部署检查清单">十、一个可落地的部署检查清单</h2>
<p>下面这份清单适合放进项目的 <code>docs/deployment.md</code>，每次模型升级时按步骤检查：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">[模型导出]
</span></span><span class="line"><span class="cl">- 固定 input shape 和 batch=1
</span></span><span class="line"><span class="cl">- 关闭训练态节点，确认 BN 已进入 eval 模式
</span></span><span class="line"><span class="cl">- 导出 ONNX 后用 onnx.checker 校验
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[图优化]
</span></span><span class="line"><span class="cl">- 统计算子类型，标记 Transpose / Gather / Slice / Resize
</span></span><span class="line"><span class="cl">- 融合 Conv + BN + Activation
</span></span><span class="line"><span class="cl">- 将 NMS、decode 等后处理移出 NPU 图
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[量化]
</span></span><span class="line"><span class="cl">- 校准集覆盖真实场景
</span></span><span class="line"><span class="cl">- 使用 per-channel weight quantization
</span></span><span class="line"><span class="cl">- 做逐层余弦相似度对比
</span></span><span class="line"><span class="cl">- 对关键指标做回归测试
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[编译]
</span></span><span class="line"><span class="cl">- 查看 NPU 子图覆盖率
</span></span><span class="line"><span class="cl">- 保存编译报告和 SDK 版本
</span></span><span class="line"><span class="cl">- 记录输入布局、量化参数和 runtime 配置
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">[运行]
</span></span><span class="line"><span class="cl">- 预分配输入输出 buffer
</span></span><span class="line"><span class="cl">- 使用硬件预处理或 zero-copy
</span></span><span class="line"><span class="cl">- 分离采集、推理、后处理线程
</span></span><span class="line"><span class="cl">- 记录端到端延迟、P95、P99 和温度
</span></span></code></pre></div><p>这份清单的价值在于让部署过程可复现。很多团队的问题不是没有优化能力，而是每次模型、SDK、板级系统一起变化，最后不知道性能变化来自哪里。只要把转换命令、校准数据版本、编译报告和测试结果保存下来，排查效率会高很多。</p>
<h2 id="十一常见坑与解决办法">十一、常见坑与解决办法</h2>
<p><strong>1. PC 上 ONNX 推理正确，NPU 结果明显偏移。</strong> 先检查输入归一化顺序、RGB/BGR、NHWC/NCHW 和量化 scale。很多精度问题并不是 NPU 算错，而是预处理和训练时不一致。</p>
<p><strong>2. 转换工具提示某个算子不支持。</strong> 不要急着换芯片，先看该算子是否可以常量折叠、替换或移到后处理。比如检测模型里的 <code>NonMaxSuppression</code> 通常没必要放在 NPU 图里。</p>
<p><strong>3. INT8 精度下降很大。</strong> 增加校准集多样性，检查异常激活层，尝试 per-channel、混合精度或量化感知训练。对于检测头、注意力模块等敏感部分，可以保留 FP16，前提是硬件支持且性能可接受。</p>
<p><strong>4. 单独 benchmark 很快，业务程序很慢。</strong> 重点看预处理、后处理、线程同步、日志打印和内存拷贝。厂商 benchmark 往往只测 NPU 子图，不包含摄像头和业务逻辑。</p>
<p><strong>5. 长时间运行后变慢。</strong> 检查温度、供电、内存泄漏和 buffer 队列堆积。嵌入式设备的持续性能比冷启动成绩更重要。</p>
<h2 id="十二总结npu-优化是模型编译器和系统工程的合题">十二、总结：NPU 优化是模型、编译器和系统工程的合题</h2>
<p>嵌入式 NPU 部署不是把模型丢进转换工具那么简单。一个稳定高效的方案，需要模型结构顺着硬件走，量化校准贴近真实数据，编译报告能解释每个子图，运行时流水线减少等待，系统层面控制带宽、功耗和温度。</p>
<p>如果把经验压缩成一句话：先保证算子连续落在 NPU 上，再减少大特征图搬运，最后才追求单个算子的极限优化。很多项目真正的性能提升，来自删掉几个多余的 <code>Transpose</code>、把 NMS 移出图、使用硬件 resize、复用 DMA buffer，而不是盲目追逐更大的 TOPS。</p>
<p>面向未来，边缘 NPU 会继续向更强的混合精度、更好的 Transformer 支持和更完善的软件栈发展。但在可预见的几年里，嵌入式工程师仍然需要理解硬件约束。懂模型的人写出的网络更容易部署，懂系统的人能把 NPU 放进真实产品流水线，懂架构的人则能在性能、功耗和成本之间做出正确取舍。</p>
<p>（全文完，约7200字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
