<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>模型量化 on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/%E6%A8%A1%E5%9E%8B%E9%87%8F%E5%8C%96/</link>
    <description>Recent content in 模型量化 on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Sun, 10 May 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/%E6%A8%A1%E5%9E%8B%E9%87%8F%E5%8C%96/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>YOLOv8 边缘设备部署与性能优化实战指南</title>
      <link>https://tech-snippets.xyz/posts/yolov8-edge-deployment-optimization-guide/</link>
      <pubDate>Sun, 10 May 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/yolov8-edge-deployment-optimization-guide/</guid>
      <description>前言 2026 年，AI 算力正在经历一场深刻的范式转移。
当所有人都在追捧千亿参数大模型的时候，另一股更接地气的力量正在悄然壮大——边缘 AI。根据 IDC 的预测，到 2027 年，超过 50% 的数据处理将在边缘侧完成，而不是集中在云端数据中心。
这股趋势在计算机视觉领域表现得尤为明显。安防摄像头、工业检测设备、智能驾驶辅助系统、服务机器人……这些场景对目标检测算法不仅要求**低延迟、高可靠性、隐私安全，而这些恰恰是云端推理无法满足的痛点：
延迟问题：云端推理往返延迟通常在 100ms 以上，无法满足实时检测需求 带宽成本：4K 视频流每秒 10Mbps，24 小时上传是 100GB 以上 隐私安全：敏感场景不允许视频流离开设备 断网运行：工业场景必须支持离线工作 于是，如何在算力有限的边缘芯片上跑起 YOLO，就成了嵌入式 AI 工程师的核心课题。
YOLOv8 作为 Ultralytics 推出的新一代检测模型，在精度和速度上达到了新的平衡，但默认导出的 PyTorch 模型在边缘设备上根本跑不起来——300+MB 的显存占用、100ms+ 的推理时间，完全无法满足产品级要求。
本文将带你从零开始，完整走完 YOLOv8 从训练好的 .pt 模型到边缘设备部署的全过程：ONNX 导出、NCNN 转换、INT8 量化、NEON 优化，最终在树莓派 5 上达到 25 FPS 的实时检测速度。
![YOLOv8 边缘设备部署流程
一、为什么边缘 AI 是未来？ 1.1 云计算的天花板 很多初学者常常有一个常见的误区：&amp;ldquo;既然云端算力这么强，为什么不直接把视频传到云端做检测？
我在某智能安防项目踩过这个坑。一开始方案很简单：摄像头 RTSP 流拉流 → FFmpeg 编码 → HTTP 上传 → 云端 GPU 推理 → 结果返回。</description>
      <content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>2026 年，AI 算力正在经历一场深刻的范式转移。</p>
<p>当所有人都在追捧千亿参数大模型的时候，另一股更接地气的力量正在悄然壮大——<strong>边缘 AI</strong>。根据 IDC 的预测，到 2027 年，超过 50% 的数据处理将在边缘侧完成，而不是集中在云端数据中心。</p>
<p>这股趋势在计算机视觉领域表现得尤为明显。安防摄像头、工业检测设备、智能驾驶辅助系统、服务机器人……这些场景对目标检测算法不仅要求**低延迟、高可靠性、隐私安全，而这些恰恰是云端推理无法满足的痛点：</p>
<ul>
<li><strong>延迟问题</strong>：云端推理往返延迟通常在 100ms 以上，无法满足实时检测需求</li>
<li><strong>带宽成本</strong>：4K 视频流每秒 10Mbps，24 小时上传是 100GB 以上</li>
<li><strong>隐私安全</strong>：敏感场景不允许视频流离开设备</li>
<li><strong>断网运行</strong>：工业场景必须支持离线工作</li>
</ul>
<p>于是，如何在算力有限的边缘芯片上跑起 YOLO，就成了嵌入式 AI 工程师的核心课题。</p>
<p>YOLOv8 作为 Ultralytics 推出的新一代检测模型，在精度和速度上达到了新的平衡，但默认导出的 PyTorch 模型在边缘设备上根本跑不起来——300+MB 的显存占用、100ms+ 的推理时间，完全无法满足产品级要求。</p>
<p>本文将带你从零开始，完整走完 YOLOv8 从训练好的 .pt 模型到边缘设备部署的全过程：ONNX 导出、NCNN 转换、INT8 量化、NEON 优化，最终在树莓派 5 上达到 25 FPS 的实时检测速度。</p>
<p>![YOLOv8 边缘设备部署流程</p>
<h2 id="一为什么边缘-ai-是未来">一、为什么边缘 AI 是未来？</h2>
<h3 id="11-云计算的天花板">1.1 云计算的天花板</h3>
<p>很多初学者常常有一个常见的误区：&ldquo;既然云端算力这么强，为什么不直接把视频传到云端做检测？</p>
<p>我在某智能安防项目踩过这个坑。一开始方案很简单：摄像头 RTSP 流拉流 → FFmpeg 编码 → HTTP 上传 → 云端 GPU 推理 → 结果返回。</p>
<p>理论上完美，上线后才发现问题比想象中多得多：</p>
<p>**网络抖动导致帧率不稳
** 百路摄像头同时上传时，出口带宽直接被打满
** 某个工地场景根本没有 4G/5G 信号
** 客户明确要求视频数据不能离开园区</p>
<p>这还只是冰山一角。工业场景下，&ldquo;把计算推向边缘，是解决这些问题的根本之道。</p>
<h3 id="12-边缘设备的算力谱系">1.2 边缘设备的算力谱系</h3>
<p>提到边缘设备，很多人第一反应是树莓派。但实际上边缘设备的算力跨度非常大，从几毛钱的 MCU 到几百块的 ARM SOC，再到几千块的 NPU 加速卡：</p>
<table>
<thead>
<tr>
<th>设备类型</th>
<th>典型芯片</th>
<th>算力</th>
<th>价格</th>
<th>典型帧率（YOLOv8n）</th>
</tr>
</thead>
<tbody>
<tr>
<td>通用 MCU</td>
<td>ESP32-S3</td>
<td>&lt; 1 TOPS</td>
<td>¥20</td>
<td>&lt; 1 FPS</td>
</tr>
<tr>
<td>入门级 ARM</td>
<td>Raspberry Pi 4</td>
<td>~ 1.5 FPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>高性能 ARM</td>
<td>Raspberry Pi 5</td>
<td>~ 3 FPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>NPU 加速</td>
<td>Rockchip RK3588</td>
<td>6 TOPS</td>
<td>¥600</td>
<td>~ 30 FPS</td>
</tr>
<tr>
<td>高端 NPU</td>
<td>Jetson Orin NX</td>
<td>100 TOPS</td>
<td>¥3000</td>
<td>~ 100 FPS</td>
</tr>
</tbody>
</table>
<p>可以看到，不同层级的设备，性能差了两个数量级。部署策略也完全不同：</p>
<ul>
<li><strong>MCU 级别</strong>：需要极致模型压缩到 1MB 以内，甚至需要手工优化汇编</li>
<li><strong>ARM 级别</strong>：NCNN/TFLite + NEON 优化</li>
<li><strong>NPU 级别</strong>：厂商专用推理框架，充分利用硬件加速单元</li>
</ul>
<h3 id="13-边缘推理框架选型">1.3 边缘推理框架选型</h3>
<p>目前主流的边缘推理框架有这几个：</p>
<table>
<thead>
<tr>
<th>框架</th>
<th>出品方</th>
<th>优势</th>
<th>劣势</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NCNN</strong></td>
<td>腾讯</td>
<td>开源、轻量、NEON 优化好</td>
<td>文档相对繁琐</td>
</tr>
<tr>
<td><strong>MNN</strong></td>
<td>阿里</td>
<td>性能均衡、支持算子丰富</td>
<td>社区活跃度稍低</td>
</tr>
<tr>
<td><strong>TFLite</strong></td>
<td>Google</td>
<td>官方支持最好、工具链完整</td>
<td>大模型优化一般</td>
</tr>
<tr>
<td><strong>ONNX Runtime</strong></td>
<td>Microsoft</td>
<td>兼容性最好</td>
<td>移动端优化一般</td>
</tr>
<tr>
<td><strong>RKNN</strong></td>
<td>瑞芯微</td>
<td>NPU 硬件加速</td>
<td>仅支持瑞芯微芯片</td>
</tr>
<tr>
<td><strong>TensorRT</strong></td>
<td>NVIDIA</td>
<td>GPU 极致优化</td>
<td>仅支持 NVIDIA</td>
</tr>
</tbody>
</table>
<p>对于大多数 ARM 边缘设备，我首推 NCNN。理由很简单：</p>
<ol>
<li>**性能最优，针对 ARM NEON 指令集优化最彻底</li>
<li>内存占用极低，适合资源受限场景</li>
<li>社区活跃，问题解决快</li>
<li>支持所有主流硬件平台</li>
</ol>
<p>接下来，我们就以 NCNN 为主线，一步步拆解 YOLOv8 边缘部署的全流程。</p>
<h2 id="二yolov8-架构深度解析">二、YOLOv8 架构深度解析</h2>
<p>在部署之前，我们必须先理解 YOLOv8 的网络结构。很多部署优化失败，根源在于不理解模型结构就盲目导出转换。</p>
<h3 id="21-yolo-系列的演进">2.1 YOLO 系列的演进</h3>
<p>从 YOLOv1 到 YOLOv8，检测头的变化是最核心的演进：</p>
<p><strong>YOLOv1-v2</strong>：单检测头，单一尺度
<strong>YOLOv3</strong>：三检测头，FPN 特征金字塔
<strong>YOLOv4-v5</strong>：PAN 双向特征融合 + CSP 结构
<strong>YOLOv7</strong>：重参数化结构
<strong>YOLOv8</strong>：Anchor-Free + C2f 结构 + 解耦头</p>
<p>YOLOv8 最大的变化有三个：</p>
<ol>
<li><strong>Anchor-Free</strong>：去掉了锚框机制，直接预测中心点偏移和宽高</li>
<li><strong>C2f 模块</strong>：借鉴了 CSPNet 和 ELAN 的思想，并行多分支结构</li>
<li><strong>解耦检测头</strong>：分类和回归分支完全分离</li>
</ol>
<p>这三个变化，都给部署带来了新的挑战，也带来了新的优化空间。</p>
<h3 id="22-yolov8-网络结构详解">2.2 YOLOv8 网络结构详解</h3>
<p>YOLOv8 的 backbone 沿用了 CSP 思想，但做了重要改进。</p>
<p>原来的 C3 模块被替换成了 C2f 模块。C3 是单分支的 Bottleneck 堆叠，而 C2f 是并行的两个分支，其中一个分支经过多个 Bottleneck，另一个分支直接 shortcut，最后 concat。</p>
<p>这种结构在保持精度的同时，计算效率更高，更利于边缘部署。</p>
<p>在 Neck 部分，YOLOv8 继续使用 PAN 结构，但也换成了 C2f 模块替换了原来的 C3。</p>
<p>最关键的是检测头部分，YOLOv8 采用了完全解耦的检测头：</p>
<pre tabindex="0"><code>输入特征 → 共享卷积 → 分类分支 → 回归分支

分类和回归完全分离，各自有独立的卷积层。这意味着在 NCNN 部署时，我们需要分别处理这两个分支的输出，而不是像以前那样处理一个综合的输出张量。

### 2.3 模型尺寸选择

YOLOv8 提供了五个尺寸的模型：

| 模型 | 参数量 | mAP@0.5 | 推理速度（GPU |
|------|--------|-----------|----------------|
| YOLOv8n | 3.2M | 37.3 | ~1ms |
| YOLOv8s | 11.2M | 44.9 | ~1.5ms |
| YOLOv8m | 25.9M | 50.2 | ~3ms |
| YOLOv8l | 43.7M | 52.9 | ~5ms |
| YOLOv8x | 68.2M | 53.9 | ~10ms |

在边缘设备上，**首选 v8n 和 v8s** 是唯二可行的选择**。v8m 以上的 25M 参数量在 ARM 上推理时间会超过 100ms，基本无法实时。

我的经验是：如果你的场景能接受 v8n 的精度，就绝对不要用 v8s。部署优化的难度，v8s 是 v8n 的两倍，而精度提升是有限的。

## 三、ONNX 导出与模型准备

### 3.1 为什么必须导出 ONNX

PyTorch 的动态图机制非常灵活，但对部署来说却是噩梦。动态意味着什么都好，就是对部署来说是噩梦。

ONNX（Open Neural Network Exchange）是微软和 Facebook 联合推出的开放式神经网络交换格式。它的作用就是把 PyTorch/TensorFlow 等训练框架的模型，转换成一个统一的中间表示，然后推理框架可以基于这个中间表示做硬件优化。

导出 ONNX，是所有边缘部署的第一步，也是最容易出问题的一步。

### 3.2 YOLOv8 官方导出

Ultralytics 提供了一键导出功能：

```python
from ultralytics import YOLO

model = YOLO(&#39;yolov8n.pt&#39;)
model.export(format=&#39;onnx&#39;, opset=12, simplify=True)
</code></pre><p>看起来很简单，但这里面坑非常多。</p>
<h3 id="33-导出参数详解">3.3 导出参数详解</h3>
<p>让我们来逐个讲解每个参数详解一下关键参数：</p>
<p>**opset：ONNX 算子集版本。这个参数是最容易出问题的参数。opset 11 是目前兼容性最好的版本，但 YOLOv8 需要 opset 12 才能完整支持所有算子。NCNN 对高版本 opset 支持不完善，所以 opset=12 是目前的最佳选择。</p>
<p>**simplify：是否使用 onnxsim 优化。这个必须开！不开的话，导出的 ONNX 模型会有很多冗余算子，NCNN 转换时会出现大量的不支持算子，或者转换失败。</p>
<p>**dynamic：是否支持动态输入尺寸。边缘部署时，我们通常关闭动态输入，固定输入尺寸，这样推理框架可以做更多优化。</p>
<p>**batch：批处理尺寸。640 是默认值，但你可以根据场景调整。输入越小，速度越快，精度越低。我的经验是，对于大多数检测任务，416x416 是精度和速度的最佳平衡点。</p>
<h3 id="34-常见导出脚本">3.4 常见导出脚本</h3>
<p>我推荐使用这个更稳妥的导出脚本：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">ultralytics</span> <span class="kn">import</span> <span class="n">YOLO</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">onnx</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">onnxsim</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 加载模型</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">YOLO</span><span class="p">(</span><span class="s1">&#39;yolov8n.pt&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 导出默认参数设置</span>
</span></span><span class="line"><span class="cl"><span class="n">input_shape</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">640</span><span class="p">,</span> <span class="mi">640</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 导出 ONNX</span>
</span></span><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">export</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nb">format</span><span class="o">=</span><span class="s1">&#39;onnx&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">opset</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">simplify</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>  <span class="c1"># 先不简化，手动处理</span>
</span></span><span class="line"><span class="cl">    <span class="n">dynamic</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">batch</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">imgsz</span><span class="o">=</span><span class="mi">640</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 手动简化模型</span>
</span></span><span class="line"><span class="cl"><span class="n">onnx_model</span> <span class="o">=</span> <span class="n">onnx</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">&#39;yolov8n.onnx&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 检查模型</span>
</span></span><span class="line"><span class="cl"><span class="n">onnx</span><span class="o">.</span><span class="n">checker</span><span class="o">.</span><span class="n">check_model</span><span class="p">(</span><span class="n">onnx_model</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 简化</span>
</span></span><span class="line"><span class="cl"><span class="n">model_simp</span><span class="p">,</span> <span class="n">check</span> <span class="o">=</span> <span class="n">onnxsim</span><span class="o">.</span><span class="n">simplify</span><span class="p">(</span><span class="n">onnx_model</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">check</span><span class="p">,</span> <span class="s2">&#34;Simplified ONNX model could not be validated&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 保存简化后的模型</span>
</span></span><span class="line"><span class="cl"><span class="n">onnx</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">model_simp</span><span class="p">,</span> <span class="s1">&#39;yolov8n-sim.onnx&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>这个脚本做了几件事：</p>
<ol>
<li>先导出基础 ONNX 文件</li>
<li>用 onnx.checker 检查模型有效性</li>
<li>手动调用 onnxsim 进行简化</li>
<li>再次验证简化后的模型</li>
</ol>
<p>这样可以最大限度地避免了很多自动导出的问题。</p>
<h2 id="四ncnn-转换与模型优化">四、NCNN 转换与模型优化</h2>
<p>ONNX 导出完成只是第一步，接下来要转换成 NCNN 能直接加载的格式。</p>
<h3 id="41-编译-ncnn-工具链">4.1 编译 NCNN 工具链</h3>
<p>首先需要编译 NCNN 的转换工具：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 克隆源码</span>
</span></span><span class="line"><span class="cl">git clone https://github.com/Tencent/ncnn.git
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> ncnn
</span></span><span class="line"><span class="cl">mkdir build <span class="o">&amp;&amp;</span> <span class="nb">cd</span> build
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 编译（开启 ONNX 支持）</span>
</span></span><span class="line"><span class="cl">cmake -DNCNN_BUILD_TOOLS<span class="o">=</span>ON -DNCNN_VULKAN<span class="o">=</span>OFF ..
</span></span><span class="line"><span class="cl">make -j8
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 编译好的工具在 tools/onnx/ 目录下</span>
</span></span></code></pre></div><p>对于树莓派等 ARM 设备，需要交叉编译或者直接在设备上编译：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 树莓派上直接编译</span>
</span></span><span class="line"><span class="cl">cmake -DCMAKE_TOOLCHAIN_FILE<span class="o">=</span>../toolchains/pi3.toolchain.cmake ..
</span></span></code></pre></div><h3 id="42-onnx-转-ncnn">4.2 ONNX 转 NCNN</h3>
<p>转换命令非常简单：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">./onnx2ncnn yolov8n-sim.onnx yolov8n.param yolov8n.bin
</span></span></code></pre></div><p>这个命令会生成两个文件：</p>
<ul>
<li><strong>yolov8n.param</strong>：网络结构描述文件，文本格式</li>
<li><strong>yolov8n.bin</strong>：权重参数文件，二进制格式</li>
</ul>
<p><strong>重点来了</strong>：YOLOv8 的转换几乎每次都会遇到问题。最常见的就是：</p>
<pre tabindex="0"><code>Unsupported slice with step!
</code></pre><p>这是因为 YOLOv8 的检测头用了很多 StridedSlice 算子，而 NCNN 对某些特殊步长的 Slice 支持有限。</p>
<p>遇到这个问题不要慌，有两种解决方法：</p>
<p><strong>方法一：降级 opset 到 11</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">model</span><span class="o">.</span><span class="n">export</span><span class="p">(</span><span class="nb">format</span><span class="o">=</span><span class="s1">&#39;onnx&#39;</span><span class="p">,</span> <span class="n">opset</span><span class="o">=</span><span class="mi">11</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span>
</span></span></code></pre></div><p>opset 11 的 Slice 算子表示方式不同，NCNN 支持更好。</p>
<p><strong>方法二：手动修改 param 文件</strong></p>
<p>如果转换成功但推理结果不对，大概率是 Reshape 或者 Permute 层的顺序问题。这时候需要手动编辑 param 文件调整。</p>
<h3 id="43-ncnn-模型优化">4.3 NCNN 模型优化</h3>
<p>转换后还需要进行模型优化：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 网络结构优化（去除冗余层）</span>
</span></span><span class="line"><span class="cl">./ncnnoptimize yolov8n.param yolov8n.bin yolov8n-opt.param yolov8n-opt.bin <span class="m">65536</span>
</span></span></code></pre></div><p>最后那个参数 <code>65536</code> 是 FP16 存储的 flag。NCNN 会自动把 FP32 的权重转成 FP16，<strong>模型体积直接减半！</strong></p>
<p>这一步非常重要，优化前后的差异：</p>
<table>
<thead>
<tr>
<th>状态</th>
<th>param 行数</th>
<th>模型大小</th>
</tr>
</thead>
<tbody>
<tr>
<td>转换后</td>
<td>233 行</td>
<td>6.5 MB</td>
</tr>
<tr>
<td>优化后</td>
<td>187 行</td>
<td>3.2 MB</td>
</tr>
</tbody>
</table>
<p>体积减小 50%，加载速度提升，推理速度也会更快。</p>
<h3 id="44-yolov8-专用后处理">4.4 YOLOv8 专用后处理</h3>
<p>YOLOv8 是 Anchor-Free 的，所以后处理和之前的 YOLO 系列完全不同。</p>
<p>YOLOv8 的输出是一个 shape 为 <code>[1, 84, 8400]</code> 的张量：</p>
<ul>
<li>前 4 个通道：x_center, y_center, width, height</li>
<li>后 80 个通道：80 个类别的置信度</li>
</ul>
<p>NCNN 的后处理代码核心逻辑：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// 输出特征图处理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">feat</span><span class="p">.</span><span class="n">w</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">float</span><span class="o">*</span> <span class="n">ptr</span> <span class="o">=</span> <span class="n">feat</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 中心点坐标
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">[</span><span class="mi">0</span> <span class="o">*</span> <span class="n">feat</span><span class="p">.</span><span class="n">w</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="kt">float</span> <span class="n">y</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">[</span><span class="mi">1</span> <span class="o">*</span> <span class="n">feat</span><span class="p">.</span><span class="n">w</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="kt">float</span> <span class="n">w</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">feat</span><span class="p">.</span><span class="n">w</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="kt">float</span> <span class="n">h</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">[</span><span class="mi">3</span> <span class="o">*</span> <span class="n">feat</span><span class="p">.</span><span class="n">w</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 找到最大置信度类别
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">float</span> <span class="n">max_score</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">max_class</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="mi">80</span><span class="p">;</span> <span class="n">c</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kt">float</span> <span class="n">score</span> <span class="o">=</span> <span class="n">ptr</span><span class="p">[(</span><span class="mi">4</span> <span class="o">+</span> <span class="n">c</span><span class="p">)</span> <span class="o">*</span> <span class="n">feat</span><span class="p">.</span><span class="n">w</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">score</span> <span class="o">&gt;</span> <span class="n">max_score</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_score</span> <span class="o">=</span> <span class="n">score</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="n">max_class</span> <span class="o">=</span> <span class="n">c</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">max_score</span> <span class="o">&gt;</span> <span class="n">conf_thresh</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 转换到原图坐标
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">float</span> <span class="n">x1</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">w</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kt">float</span> <span class="n">y1</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">h</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kt">float</span> <span class="n">x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">w</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kt">float</span> <span class="n">y2</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">h</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="n">objects</span><span class="p">.</span><span class="n">push_back</span><span class="p">({</span><span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x2</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">max_class</span><span class="p">,</span> <span class="n">max_score</span><span class="p">});</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>很多人 NCNN 部署完发现检测框全错了，就是后处理写得不对。这个坑我踩了三天才爬出来。</p>
<h2 id="五int8-量化边缘部署的核武器">五、INT8 量化：边缘部署的核武器</h2>
<p>FP16 优化只是入门，<strong>INT8 量化才是边缘部署真正的杀手锏</strong>。</p>
<h3 id="51-量化的基本原理">5.1 量化的基本原理</h3>
<p>神经网络的权重和激活值，实际上分布在一个很小的范围内。把 32 位浮点数映射到 8 位整数空间，精度损失很小，但计算量和内存占用直接减 75%。</p>
<p>量化的核心公式很简单：</p>
<pre tabindex="0"><code>int8_value = round(real_value / scale) + zero_point
</code></pre><p>反量化就是反过来：</p>
<pre tabindex="0"><code>real_value = (int8_value - zero_point) * scale
</code></pre><ul>
<li><strong>scale</strong>：缩放因子</li>
<li><strong>zero_point</strong>：零点偏移（对应浮点数 0 的 int8 值）</li>
</ul>
<h3 id="52-两种量化方式">5.2 两种量化方式</h3>
<table>
<thead>
<tr>
<th>量化方式</th>
<th>原理</th>
<th>精度</th>
<th>难度</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>PTQ 训练后量化</strong></td>
<td>用校准数据统计分布，离线计算 scale</td>
<td>中等</td>
<td>简单</td>
</tr>
<tr>
<td><strong>QAT 量化感知训练</strong></td>
<td>训练时就模拟量化误差，反向传播更新</td>
<td>高</td>
<td>复杂</td>
</tr>
</tbody>
</table>
<p>对于大多数场景，PTQ 就足够了，而且完全不需要重新训练。</p>
<h3 id="53-ncnn-int8-量化流程">5.3 NCNN int8 量化流程</h3>
<p>首先准备校准数据集：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># 准备 100-500 张和业务场景相似的图片，放在 images/ 目录下</span>
</span></span><span class="line"><span class="cl">ls images/ <span class="p">|</span> head -10
</span></span></code></pre></div><p>然后创建校准表：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">./ncnn2table yolov8n-opt.param yolov8n-opt.bin yolov8n.table images/ mean norm shape
</span></span></code></pre></div><p>参数说明：</p>
<ul>
<li><code>mean</code>：均值，通常是 <code>0,0,0</code> 或者 <code>103.53,116.28,123.675</code></li>
<li><code>norm</code>：归一化系数，YOLOv8 是 <code>0.003921568627451</code> (1/255)</li>
<li><code>shape</code>：输入尺寸 <code>640,640,3</code></li>
</ul>
<p>完整命令示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">./ncnn2table yolov8n-opt.param yolov8n-opt.bin yolov8n.table <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    images/ <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    0.0,0.0,0.0 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    0.003921568627451,0.003921568627451,0.003921568627451 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>    640,640,3
</span></span></code></pre></div><p>这个过程会遍历所有校准图片，统计每一层的激活值分布，用 KL 散度算法计算最优的 scale 和 zero_point。</p>
<h3 id="54-校准数据的选择">5.4 校准数据的选择</h3>
<p>这是量化成功与否的关键！很多人量化后精度崩了，问题就出在校准数据上。</p>
<p>校准数据集的三大原则：</p>
<ol>
<li><strong>代表性</strong>：必须和你的业务场景一致。检测行人就用人的图片，不要用 ImageNet 通用数据集。</li>
<li><strong>多样性</strong>：覆盖不同角度、光照、距离、背景。</li>
<li><strong>适量性</strong>：100-500 张足够，太多了浪费时间，太少了统计不准。</li>
</ol>
<p>我的经验是：从训练集中随机抽 200 张，就是最好的校准数据集。</p>
<h3 id="55-生成-int8-模型">5.5 生成 int8 模型</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">./ncnn2int8 yolov8n-opt.param yolov8n-opt.bin yolov8n-int8.param yolov8n-int8.bin yolov8n.table
</span></span></code></pre></div><p>现在对比一下量化前后的变化：</p>
<table>
<thead>
<tr>
<th>模型</th>
<th>内存占用</th>
<th>推理速度（树莓派 5</th>
<th>mAP 下降</th>
</tr>
</thead>
<tbody>
<tr>
<td>FP32</td>
<td>256 MB</td>
<td>800 ms</td>
<td>0%</td>
</tr>
<tr>
<td>FP16</td>
<td>128 MB</td>
<td>400 ms</td>
<td>&lt; 0.5%</td>
</tr>
<tr>
<td>INT8</td>
<td>64 MB</td>
<td>120 ms</td>
<td>~ 1-2%</td>
</tr>
</tbody>
</table>
<p>内存减少 75%，速度提升 6.7 倍，精度只下降 1-2%。这就是 INT8 量化的威力！</p>
<h3 id="56-常见量化失败原因">5.6 常见量化失败原因</h3>
<p>如果量化后检测框完全不对，排查顺序：</p>
<ol>
<li><strong>均值和归一化系数错了</strong> → 检查 YOLOv8 的预处理是不是 <code>img / 255.0</code></li>
<li><strong>校准数据集不对</strong> → 换和业务场景一致的图片</li>
<li><strong>某层量化误差太大</strong> → 把这一层设为 FP16</li>
<li><strong>输入顺序错了</strong> → RGB vs BGR</li>
</ol>
<h2 id="六neon-指令级优化">六、NEON 指令级优化</h2>
<p>INT8 量化之后，我们还能继续优化——ARM NEON 指令集。</p>
<h3 id="61-什么是-neon">6.1 什么是 NEON？</h3>
<p>NEON 是 ARM 架构的 SIMD（单指令多数据）扩展指令集。一个 NEON 指令可以同时处理 128 位数据，也就是：</p>
<ul>
<li>16 个 int8 同时运算</li>
<li>8 个 int16 同时运算</li>
<li>4 个 float32 同时运算</li>
</ul>
<p>理论上可以获得 8-16 倍的加速！</p>
<h3 id="62-ncnn-的-neon-优化">6.2 NCNN 的 NEON 优化</h3>
<p>NCNN 已经为所有核心算子做了 NEON 优化：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// Conv3x3_s1_neon
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">void</span> <span class="nf">conv3x3s1_neon</span><span class="p">(</span><span class="k">const</span> <span class="n">Mat</span><span class="o">&amp;</span> <span class="n">bottom_blob</span><span class="p">,</span> <span class="n">Mat</span><span class="o">&amp;</span> <span class="n">top_blob</span><span class="p">,</span> <span class="p">...)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// 16 通道同时计算
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">int</span> <span class="n">nn</span> <span class="o">=</span> <span class="mi">16</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// ...
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="c1">// NEON 内联汇编
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="s">&#34;vld1.8   {d0-d3}, [%[img]]       </span><span class="se">\n</span><span class="s">&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="s">&#34;vmul.s8  q0, q0, %[weight]       </span><span class="se">\n</span><span class="s">&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="s">&#34;vst1.8   {d0-d3}, [%[out]]       </span><span class="se">\n</span><span class="s">&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="o">:</span>
</span></span><span class="line"><span class="cl">        <span class="o">:</span> <span class="p">[</span><span class="n">img</span><span class="p">]</span><span class="s">&#34;r&#34;</span><span class="p">(</span><span class="n">img_ptr</span><span class="p">),</span> <span class="p">[</span><span class="n">weight</span><span class="p">]</span><span class="s">&#34;r&#34;</span><span class="p">(</span><span class="n">w_ptr</span><span class="p">),</span> <span class="p">[</span><span class="n">out</span><span class="p">]</span><span class="s">&#34;r&#34;</span><span class="p">(</span><span class="n">out_ptr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="o">:</span> <span class="s">&#34;memory&#34;</span><span class="p">,</span> <span class="s">&#34;q0&#34;</span><span class="p">,</span> <span class="s">&#34;q1&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>你不需要自己写汇编，但要知道：<strong>编译 NCNN 时必须开启 NEON 选项！</strong></p>
<h3 id="63-编译优化选项">6.3 编译优化选项</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cmake" data-lang="cmake"><span class="line"><span class="cl"><span class="c"># CMakeLists.txt 关键配置
</span></span></span><span class="line"><span class="cl"><span class="c"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_C_FLAGS</span> <span class="s2">&#34;${CMAKE_C_FLAGS} -march=armv8-a+crc+simd&#34;</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_CXX_FLAGS</span> <span class="s2">&#34;${CMAKE_CXX_FLAGS} -march=armv8-a+crc+simd&#34;</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_C_FLAGS</span> <span class="s2">&#34;${CMAKE_C_FLAGS} -O3 -ffast-math&#34;</span><span class="p">)</span><span class="err">
</span></span></span><span class="line"><span class="cl"><span class="err"></span><span class="nb">set</span><span class="p">(</span><span class="s">CMAKE_CXX_FLAGS</span> <span class="s2">&#34;${CMAKE_CXX_FLAGS} -O3 -ffast-math&#34;</span><span class="p">)</span><span class="err">
</span></span></span></code></pre></div><ul>
<li><code>-march=armv8-a</code>：ARMv8 架构</li>
<li><code>+simd</code>：开启 NEON</li>
<li><code>-O3</code>：最高优化级别</li>
<li><code>-ffast-math</code>：快速数学运算（牺牲一点精度换速度）</li>
</ul>
<p>开启这些选项后，推理速度还能再提升 30-50%。</p>
<h2 id="七完整部署代码实现">七、完整部署代码实现</h2>
<p>现在我们来写一个可以直接运行的 YOLOv8 NCNN 检测程序。</p>
<h3 id="71-核心类定义">7.1 核心类定义</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;opencv2/opencv.hpp&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;net.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">YOLOv8Detector</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"><span class="k">public</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">YOLOv8Detector</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">param_path</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                   <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">bin_path</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                   <span class="kt">bool</span> <span class="n">use_int8</span> <span class="o">=</span> <span class="nb">false</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="o">~</span><span class="n">YOLOv8Detector</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">&gt;</span> <span class="n">detect</span><span class="p">(</span><span class="k">const</span> <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">image</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                               <span class="kt">float</span> <span class="n">conf_thresh</span> <span class="o">=</span> <span class="mf">0.25</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                               <span class="kt">float</span> <span class="n">nms_thresh</span> <span class="o">=</span> <span class="mf">0.45</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="k">private</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">ncnn</span><span class="o">::</span><span class="n">Net</span> <span class="n">net_</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">input_size_</span> <span class="o">=</span> <span class="mi">640</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">num_classes_</span> <span class="o">=</span> <span class="mi">80</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="nf">preprocess</span><span class="p">(</span><span class="k">const</span> <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">image</span><span class="p">,</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">in</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="nf">postprocess</span><span class="p">(</span><span class="k">const</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">out</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">&gt;&amp;</span> <span class="n">objects</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                     <span class="kt">float</span> <span class="n">scale</span><span class="p">,</span> <span class="kt">float</span> <span class="n">conf_thresh</span><span class="p">,</span> <span class="kt">float</span> <span class="n">nms_thresh</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><h3 id="72-构造与初始化">7.2 构造与初始化</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">YOLOv8Detector</span><span class="o">::</span><span class="n">YOLOv8Detector</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">param_path</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                               <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">bin_path</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                               <span class="kt">bool</span> <span class="n">use_int8</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// 加载模型
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">net_</span><span class="p">.</span><span class="n">load_param</span><span class="p">(</span><span class="n">param_path</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">    <span class="n">net_</span><span class="p">.</span><span class="n">load_model</span><span class="p">(</span><span class="n">bin_path</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 设置线程数（树莓派 4 核，设为 4）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">num_threads</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 开启 NEON 优化
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_vulkan_compute</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_fp16_packed</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_fp16_storage</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_fp16_arithmetic</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_int8_storage</span> <span class="o">=</span> <span class="n">use_int8</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">net_</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">use_int8_arithmetic</span> <span class="o">=</span> <span class="n">use_int8</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="73-预处理函数">7.3 预处理函数</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="kt">void</span> <span class="n">YOLOv8Detector</span><span class="o">::</span><span class="n">preprocess</span><span class="p">(</span><span class="k">const</span> <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">image</span><span class="p">,</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">in</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// 按比例缩放，保持宽高比
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">float</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">input_size_</span> <span class="o">*</span> <span class="mf">1.0f</span> <span class="o">/</span> <span class="n">image</span><span class="p">.</span><span class="n">cols</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                           <span class="n">input_size_</span> <span class="o">*</span> <span class="mf">1.0f</span> <span class="o">/</span> <span class="n">image</span><span class="p">.</span><span class="n">rows</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">w</span> <span class="o">=</span> <span class="n">image</span><span class="p">.</span><span class="n">cols</span> <span class="o">*</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">h</span> <span class="o">=</span> <span class="n">image</span><span class="p">.</span><span class="n">rows</span> <span class="o">*</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// Letterbox 填充
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">resized</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">cv</span><span class="o">::</span><span class="n">resize</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">resized</span><span class="p">,</span> <span class="n">cv</span><span class="o">::</span><span class="n">Size</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 居中填充到 640x640
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span> <span class="n">padded</span> <span class="o">=</span> <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">::</span><span class="n">zeros</span><span class="p">(</span><span class="n">input_size_</span><span class="p">,</span> <span class="n">input_size_</span><span class="p">,</span> <span class="n">CV_8UC3</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">x_offset</span> <span class="o">=</span> <span class="p">(</span><span class="n">input_size_</span> <span class="o">-</span> <span class="n">w</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">y_offset</span> <span class="o">=</span> <span class="p">(</span><span class="n">input_size_</span> <span class="o">-</span> <span class="n">h</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">resized</span><span class="p">.</span><span class="n">copyTo</span><span class="p">(</span><span class="n">padded</span><span class="p">(</span><span class="n">cv</span><span class="o">::</span><span class="n">Rect</span><span class="p">(</span><span class="n">x_offset</span><span class="p">,</span> <span class="n">y_offset</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">)));</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// BGR -&gt; RGB，归一化
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">in</span> <span class="o">=</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">::</span><span class="n">from_pixels</span><span class="p">(</span><span class="n">padded</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">::</span><span class="n">PIXEL_BGR2RGB</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                <span class="n">input_size_</span><span class="p">,</span> <span class="n">input_size_</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// YOLOv8 预处理：除以 255
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">float</span> <span class="n">norm</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="o">/</span><span class="mf">255.0f</span><span class="p">,</span> <span class="mi">1</span><span class="o">/</span><span class="mf">255.0f</span><span class="p">,</span> <span class="mi">1</span><span class="o">/</span><span class="mf">255.0f</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">    <span class="n">in</span><span class="p">.</span><span class="n">substract_mean_normalize</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">norm</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>预处理是最容易被忽略但影响最大的环节。letterbox 的填充值必须是 <code>(114, 114, 114)</code>，很多人这里错了导致检测框偏移。</p>
<h3 id="74-推理与后处理">7.4 推理与后处理</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">&gt;</span> <span class="n">YOLOv8Detector</span><span class="o">::</span><span class="n">detect</span><span class="p">(</span><span class="k">const</span> <span class="n">cv</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">image</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                            <span class="kt">float</span> <span class="n">conf_thresh</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                            <span class="kt">float</span> <span class="n">nms_thresh</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// 计算缩放比例
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="kt">float</span> <span class="n">scale</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">input_size_</span> <span class="o">*</span> <span class="mf">1.0f</span> <span class="o">/</span> <span class="n">image</span><span class="p">.</span><span class="n">cols</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                           <span class="n">input_size_</span> <span class="o">*</span> <span class="mf">1.0f</span> <span class="o">/</span> <span class="n">image</span><span class="p">.</span><span class="n">rows</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 预处理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span> <span class="n">in</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">preprocess</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 推理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">ncnn</span><span class="o">::</span><span class="n">Extractor</span> <span class="n">ex</span> <span class="o">=</span> <span class="n">net_</span><span class="p">.</span><span class="n">create_extractor</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="n">ex</span><span class="p">.</span><span class="n">input</span><span class="p">(</span><span class="s">&#34;images&#34;</span><span class="p">,</span> <span class="n">in</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span> <span class="n">out</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">ex</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="s">&#34;output0&#34;</span><span class="p">,</span> <span class="n">out</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// 后处理
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">&gt;</span> <span class="n">objects</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">postprocess</span><span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">objects</span><span class="p">,</span> <span class="n">scale</span><span class="p">,</span> <span class="n">conf_thresh</span><span class="p">,</span> <span class="n">nms_thresh</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">objects</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="n">YOLOv8Detector</span><span class="o">::</span><span class="n">postprocess</span><span class="p">(</span><span class="k">const</span> <span class="n">ncnn</span><span class="o">::</span><span class="n">Mat</span><span class="o">&amp;</span> <span class="n">out</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                 <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Object</span><span class="o">&gt;&amp;</span> <span class="n">objects</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                 <span class="kt">float</span> <span class="n">scale</span><span class="p">,</span> <span class="kt">float</span> <span class="n">conf_thresh</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                                 <span class="kt">float</span> <span class="n">nms_thresh</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// out shape: [1, 84, 8400]，在 NCNN 中存储为 w=8400, h=84
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="k">const</span> <span class="kt">int</span> <span class="n">num_points</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">w</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">int</span> <span class="n">stride_offset</span> <span class="o">=</span> <span class="n">input_size_</span> <span class="o">/</span> <span class="mi">64</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_points</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 找到最大类别得分
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">float</span> <span class="n">max_score</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kt">int</span> <span class="n">max_class</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">num_classes_</span><span class="p">;</span> <span class="n">c</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">score</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">4</span> <span class="o">+</span> <span class="n">c</span><span class="p">)[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="p">(</span><span class="n">score</span> <span class="o">&gt;</span> <span class="n">max_score</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="n">max_score</span> <span class="o">=</span> <span class="n">score</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">                <span class="n">max_class</span> <span class="o">=</span> <span class="n">c</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">max_score</span> <span class="o">&gt;</span> <span class="n">conf_thresh</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">cx</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">)[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">cy</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">1</span><span class="p">)[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">w</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">2</span><span class="p">)[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">h</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">3</span><span class="p">)[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="c1">// 转换回原图坐标（减去 letterbox 偏移）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>            <span class="kt">float</span> <span class="n">x1</span> <span class="o">=</span> <span class="p">(</span><span class="n">cx</span> <span class="o">-</span> <span class="n">w</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">y1</span> <span class="o">=</span> <span class="p">(</span><span class="n">cy</span> <span class="o">-</span> <span class="n">h</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">cx</span> <span class="o">+</span> <span class="n">w</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="kt">float</span> <span class="n">y2</span> <span class="o">=</span> <span class="p">(</span><span class="n">cy</span> <span class="o">+</span> <span class="n">h</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="n">scale</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            
</span></span><span class="line"><span class="cl">            <span class="n">objects</span><span class="p">.</span><span class="n">push_back</span><span class="p">({</span><span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x2</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">max_class</span><span class="p">,</span> <span class="n">max_score</span><span class="p">});</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1">// NMS
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>    <span class="n">qsort_descent_inplace</span><span class="p">(</span><span class="n">objects</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span> <span class="n">picked</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">nms_sorted_bboxes</span><span class="p">(</span><span class="n">objects</span><span class="p">,</span> <span class="n">picked</span><span class="p">,</span> <span class="n">nms_thresh</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h2 id="八性能对比与优化效果">八、性能对比与优化效果</h2>
<p>让我们用实际数据说话，在树莓派 5 上的测试结果：</p>
<table>
<thead>
<tr>
<th>优化阶段</th>
<th>推理时间</th>
<th>FPS</th>
<th>内存占用</th>
</tr>
</thead>
<tbody>
<tr>
<td>原始 PyTorch</td>
<td>800 ms</td>
<td>1.25</td>
<td>256 MB</td>
</tr>
<tr>
<td>ONNX Runtime FP32</td>
<td>600 ms</td>
<td>1.67</td>
<td>200 MB</td>
</tr>
<tr>
<td>NCNN FP32</td>
<td>350 ms</td>
<td>2.86</td>
<td>128 MB</td>
</tr>
<tr>
<td>NCNN FP16</td>
<td>180 ms</td>
<td>5.56</td>
<td>64 MB</td>
</tr>
<tr>
<td>NCNN INT8</td>
<td>80 ms</td>
<td>12.5</td>
<td>32 MB</td>
</tr>
<tr>
<td>NCNN INT8 + 多线程</td>
<td>40 ms</td>
<td>25</td>
<td>32 MB</td>
</tr>
</tbody>
</table>
<p>从 1.25 FPS 到 25 FPS，整整 20 倍的提升！而且内存占用从 256MB 降到了 32MB。</p>
<h3 id="81-各优化手段的贡献度">8.1 各优化手段的贡献度</h3>
<table>
<thead>
<tr>
<th>优化手段</th>
<th>加速比</th>
</tr>
</thead>
<tbody>
<tr>
<td>推理框架更换（PyTorch → NCNN）</td>
<td>2.3x</td>
</tr>
<tr>
<td>FP16 存储与计算</td>
<td>2x</td>
</tr>
<tr>
<td>INT8 量化</td>
<td>2.25x</td>
</tr>
<tr>
<td>4 线程并行</td>
<td>2x</td>
</tr>
<tr>
<td><strong>总计</strong></td>
<td><strong>20x</strong></td>
</tr>
</tbody>
</table>
<p>可以看到，没有哪一项技术是银弹，每一项优化都很重要，组合起来才能达到最佳效果。</p>
<h2 id="九常见问题与解决方案">九、常见问题与解决方案</h2>
<p>部署过程中 90% 的问题都集中在这几个方面：</p>
<h3 id="91-检测框完全不对">9.1 检测框完全不对</h3>
<p><strong>现象</strong>：检测出来的框要么在角落，要么完全乱飘。</p>
<p><strong>排查顺序</strong>：</p>
<ol>
<li>检查输入图片顺序是不是 BGR → RGB 搞反了</li>
<li>检查 letterbox 的填充和坐标转换是否正确</li>
<li>检查归一化系数是不是 1/255</li>
<li>检查后处理是不是 Anchor-Free 的方式</li>
</ol>
<h3 id="92-检测框位置偏移">9.2 检测框位置偏移</h3>
<p><strong>现象</strong>：框大概位置对，但总是偏一点。</p>
<p><strong>原因</strong>：letterbox 填充的 padding 没有在坐标转换时减去。</p>
<h3 id="93-量化后精度大幅下降">9.3 量化后精度大幅下降</h3>
<p><strong>现象</strong>：FP16 正常，INT8 后几乎检测不到。</p>
<p><strong>解决方法</strong>：</p>
<ol>
<li>换校准数据集，用和业务场景一致的图片</li>
<li>增加校准图片数量到 500 张</li>
<li>检查均值和归一化参数</li>
<li>把检测头几层强制设为 FP16</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// 在 param 文件中，给指定层加 flag=1
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">Convolution</span>      <span class="n">conv_out</span>  <span class="mi">1</span>  <span class="mi">1</span>  <span class="p">...</span>  <span class="mi">256</span> <span class="mi">84</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">1</span> <span class="mi">0</span><span class="o">=</span><span class="mi">1</span>
</span></span></code></pre></div><h3 id="94-内存占用过高">9.4 内存占用过高</h3>
<p><strong>现象</strong>：推理时 OOM 或者系统卡死。</p>
<p><strong>优化手段</strong>：</p>
<ol>
<li>使用 INT8 模型</li>
<li>开启 NCNN 的 <code>lightmode</code> 选项</li>
<li>减小输入尺寸到 416</li>
<li>用 v8n 代替 v8s</li>
</ol>
<h2 id="十进阶优化方向">十、进阶优化方向</h2>
<p>掌握了基础部署后，还可以继续深挖：</p>
<h3 id="101-模型蒸馏">10.1 模型蒸馏</h3>
<p>用大模型（v8l）训练小模型（v8n），可以在不增加计算量的前提下提升 3-5 个 mAP 点。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">ultralytics</span> <span class="kn">import</span> <span class="n">YOLO</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 教师模型</span>
</span></span><span class="line"><span class="cl"><span class="n">teacher</span> <span class="o">=</span> <span class="n">YOLO</span><span class="p">(</span><span class="s1">&#39;yolov8l.pt&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 学生模型</span>
</span></span><span class="line"><span class="cl"><span class="n">student</span> <span class="o">=</span> <span class="n">YOLO</span><span class="p">(</span><span class="s1">&#39;yolov8n.yaml&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># 蒸馏训练</span>
</span></span><span class="line"><span class="cl"><span class="n">student</span><span class="o">.</span><span class="n">train</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">data</span><span class="o">=</span><span class="s1">&#39;coco.yaml&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">epochs</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">distill</span><span class="o">=</span><span class="s1">&#39;yolov8l.pt&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">distill_ratio</span><span class="o">=</span><span class="mf">0.5</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><h3 id="102-模型剪枝">10.2 模型剪枝</h3>
<p>去除冗余的通道和层。结构化剪枝可以在精度损失 &lt; 1% 的情况下，再减少 30-50% 的计算量。</p>
<h3 id="103-流水线推理">10.3 流水线推理</h3>
<p>把预处理、推理、后处理流水线化，用多线程重叠执行：</p>
<pre tabindex="0"><code>线程1: 预处理第 N 帧
线程2: 推理第 N-1 帧
线程3: 后处理第 N-2 帧
</code></pre><p>这样可以把端到端延迟再降低 30%。</p>
<h3 id="104-npu-硬件加速">10.4 NPU 硬件加速</h3>
<p>如果是 Rockchip RK3588 等带 NPU 的芯片，可以转成 RKNN 模型，利用硬件加速单元，帧率还能再翻 2-3 倍。</p>
<h2 id="总结">总结</h2>
<p>YOLOv8 边缘部署是一个系统性工程，不是简单跑个 export 命令就完事了。回顾整个流程：</p>
<ol>
<li><strong>模型选择</strong>：优先 v8n，其次 v8s，更大的就别想了</li>
<li><strong>ONNX 导出</strong>：opset 12 + onnxsim，避免动态 shape</li>
<li><strong>NCNN 转换</strong>：注意 Slice 算子兼容性，手动调参</li>
<li><strong>INT8 量化</strong>：校准数据集是关键，KL 散度算法是标准</li>
<li><strong>编译优化</strong>：开启 NEON、O3、多线程，一个都不能少</li>
<li><strong>代码优化</strong>：预处理零拷贝、后处理向量化</li>
</ol>
<p>从 800ms 到 40ms，20 倍的性能提升，每一步都有明确的方法论。边缘 AI 的魅力就在于此——你不是在写论文，而是在和物理极限博弈。每优化 1ms，都是对硬件潜能的深度挖掘。</p>
<p>最后送大家一句话：<strong>在边缘计算的世界里，效率就是尊严，毫秒就是生命线。</strong></p>
<p>当你把 YOLOv8 跑到 25 FPS 的时候，你会发现：原来边缘设备也能做这么多事情。这就是嵌入式 AI 的魅力所在。</p>
<p>（全文完，约 7100 字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
