<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>DDR on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/ddr/</link>
    <description>Recent content in DDR on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Mon, 01 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/ddr/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>DDR 内存带宽调优实战：从 AXI 总线到 Cache Miss 的 SoC 性能优化指南</title>
      <link>https://tech-snippets.xyz/posts/ddr-memory-controller-bandwidth-optimization-guide/</link>
      <pubDate>Mon, 01 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/ddr-memory-controller-bandwidth-optimization-guide/</guid>
      <description>前言 做嵌入式 Linux 或边缘 AI 项目时，很多性能问题最后都会绕回一个朴素但容易被低估的事实：算力不等于吞吐，CPU、NPU、GPU 跑得再快，只要数据喂不上去，整机性能就会被内存系统卡住。
我第一次真正意识到 DDR 带宽的重要性，是在一块多核 ARM SoC 上做 4 路摄像头视频分析。算法同事看 NPU 利用率只有 40% 左右，以为模型还可以继续加大；系统同事看 CPU 使用率也不高，以为瓶颈不在软件。直到我们把 ISP、RGA、NPU、VPU 同时压起来，再去读 DDR 控制器计数器，才发现内存读写已经接近平台可持续带宽的上限。那一刻，所谓“还有很多算力没用上”，其实只是“大家都在等内存”。
这篇文章想把这个问题讲透一点：DDR 带宽不是一个孤立参数，它贯穿了 CPU Cache、AXI/NoC 互联、DMA burst、内存控制器调度、DRAM Bank 冲突、刷新开销以及 Linux 调度策略。很多项目里大家会直接跑一个 memcpy 或 stream，看到数字不错就认为内存没问题；但真实业务往往不是连续大块搬运，而是多个主设备同时访问、读写混合、缓存命中率波动、实时任务和后台任务互相抢总线。
本文会从 SoC 视角出发，拆解一条内存访问路径，并给出一套可以落地的排查和优化方法。示例代码以 Linux 用户态为主，兼顾裸机/RTOS 下的思路。目标不是把每个 DDR 时序参数都背下来，而是建立一个工程上有用的判断框架：什么时候该看 Cache Miss，什么时候该看 AXI outstanding，什么时候该怀疑 DDR controller 的 page policy，什么时候该从数据布局和 DMA burst 入手。
一、先把“带宽”这件事说清楚 DDR 厂商手册里常见的理论带宽计算很简单：
理论带宽 = 数据总线宽度 / 8 × 数据传输速率 例如 32-bit LPDDR4X，数据速率 4266 MT/s，理论峰值约为：</description>
      <content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>做嵌入式 Linux 或边缘 AI 项目时，很多性能问题最后都会绕回一个朴素但容易被低估的事实：算力不等于吞吐，CPU、NPU、GPU 跑得再快，只要数据喂不上去，整机性能就会被内存系统卡住。</p>
<p>我第一次真正意识到 DDR 带宽的重要性，是在一块多核 ARM SoC 上做 4 路摄像头视频分析。算法同事看 NPU 利用率只有 40% 左右，以为模型还可以继续加大；系统同事看 CPU 使用率也不高，以为瓶颈不在软件。直到我们把 ISP、RGA、NPU、VPU 同时压起来，再去读 DDR 控制器计数器，才发现内存读写已经接近平台可持续带宽的上限。那一刻，所谓“还有很多算力没用上”，其实只是“大家都在等内存”。</p>
<p>这篇文章想把这个问题讲透一点：DDR 带宽不是一个孤立参数，它贯穿了 CPU Cache、AXI/NoC 互联、DMA burst、内存控制器调度、DRAM Bank 冲突、刷新开销以及 Linux 调度策略。很多项目里大家会直接跑一个 <code>memcpy</code> 或 <code>stream</code>，看到数字不错就认为内存没问题；但真实业务往往不是连续大块搬运，而是多个主设备同时访问、读写混合、缓存命中率波动、实时任务和后台任务互相抢总线。</p>
<p>本文会从 SoC 视角出发，拆解一条内存访问路径，并给出一套可以落地的排查和优化方法。示例代码以 Linux 用户态为主，兼顾裸机/RTOS 下的思路。目标不是把每个 DDR 时序参数都背下来，而是建立一个工程上有用的判断框架：什么时候该看 Cache Miss，什么时候该看 AXI outstanding，什么时候该怀疑 DDR controller 的 page policy，什么时候该从数据布局和 DMA burst 入手。</p>
<p><img alt="SoC DDR 带宽路径与调优观测点" loading="lazy" src="/images/ddr-memory-bandwidth-architecture.svg"></p>
<h2 id="一先把带宽这件事说清楚">一、先把“带宽”这件事说清楚</h2>
<p>DDR 厂商手册里常见的理论带宽计算很简单：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">理论带宽 = 数据总线宽度 / 8 × 数据传输速率
</span></span></code></pre></div><p>例如 32-bit LPDDR4X，数据速率 4266 MT/s，理论峰值约为：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">32 / 8 × 4266 = 17064 MB/s ≈ 17 GB/s
</span></span></code></pre></div><p>这个数字看起来很漂亮，但工程上最容易踩的坑，就是把理论峰值当作业务可用带宽。实际系统里至少有几类损耗：</p>
<ol>
<li><strong>协议和控制开销</strong>：DRAM 不是一个无限快的 SRAM，行打开、预充电、刷新、读写切换都会消耗周期。</li>
<li><strong>访问模式损耗</strong>：连续访问和随机访问差别巨大，同一 Row 命中和频繁换 Row 的效率完全不同。</li>
<li><strong>多主设备竞争</strong>：CPU、GPU、NPU、ISP、VPU、显示控制器、PCIe、USB 都可能通过 AXI/NoC 访问 DDR。</li>
<li><strong>Cache 行粒度放大</strong>：CPU 读一个 4 字节整数，如果它不在 Cache 里，通常会拉回一整条 Cache Line。</li>
<li><strong>软件栈额外拷贝</strong>：视频帧、网络包、AI tensor 如果在多个模块之间来回复制，带宽会被悄悄吃掉。</li>
</ol>
<p>所以我更喜欢把带宽分成三个层次：</p>
<ul>
<li><strong>理论峰值带宽</strong>：由 DDR 类型、频率、位宽决定，用于判断上限。</li>
<li><strong>平台可持续带宽</strong>：在稳定温度、电压、频率下，通过基准测试长期跑出来的数字。</li>
<li><strong>业务有效带宽</strong>：真实业务中，真正转化为有效计算的数据吞吐。</li>
</ul>
<p>优化时最重要的是第三个。一个平台跑 STREAM 能到 12 GB/s，不代表你的视觉 pipeline 就能用到 12 GB/s。如果算法访问模式很差，或者多媒体 DMA 和 CPU 同时抢总线，业务有效带宽可能只有几 GB/s，甚至更低。</p>
<h2 id="二一次内存访问到底经过了哪里">二、一次内存访问到底经过了哪里</h2>
<p>从 CPU 角度看，一行 C 代码可能只是：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="n">sum</span> <span class="o">+=</span> <span class="n">buffer</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span></span></code></pre></div><p>但在 SoC 内部，这次读取可能经历下面的路径：</p>
<ol>
<li>CPU 先查 L1 Data Cache；</li>
<li>L1 miss 后查 L2；</li>
<li>L2 miss 后查 LLC 或系统级缓存；</li>
<li>仍然 miss，就通过 ACE/CHI/AXI 接口发起读事务；</li>
<li>请求进入 NoC 或 AXI interconnect，和其他 master 仲裁；</li>
<li>DDR controller 接收请求，决定访问哪个 channel、rank、bank、row；</li>
<li>如果目标 row 已打开，直接读；否则需要 precharge/activate；</li>
<li>数据经过 PHY 回来，再沿互联返回 CPU；</li>
<li>Cache line 被填入，CPU 才能继续执行依赖这份数据的指令。</li>
</ol>
<p>这个路径里任何一段都可能成为瓶颈。CPU 侧看到的是 <code>cache-misses</code>、<code>stalled-cycles</code>、IPC 下降；互联侧看到的是 outstanding 堆积、QoS 延迟变大；DDR 控制器侧看到的是读写队列拥塞、page miss 增加、refresh 周期影响；业务侧看到的则是帧率下降、推理延迟抖动、实时线程偶发超时。</p>
<p>如果只盯着一个指标，很容易误判。例如 CPU Cache Miss 高，不一定表示 DDR 频率不够，也可能是数据结构布局导致空间局部性太差；DDR 带宽占用高，也不一定要提高频率，可能是多了一次无意义的内存拷贝。</p>
<h2 id="三建立第一组基准连续带宽随机延迟和业务混压">三、建立第一组基准：连续带宽、随机延迟和业务混压</h2>
<p>调优前必须先建立基线。我的习惯是至少跑三类测试：</p>
<h3 id="31-连续读写带宽">3.1 连续读写带宽</h3>
<p>连续带宽可以用 STREAM，也可以写一个简化版测试。下面这个程序不如专业工具严谨，但适合快速确认不同 buffer 大小、不同线程数下的变化趋势：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;time.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">double</span> <span class="nf">now_sec</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">struct</span> <span class="n">timespec</span> <span class="n">ts</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">clock_gettime</span><span class="p">(</span><span class="n">CLOCK_MONOTONIC</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ts</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">ts</span><span class="p">.</span><span class="n">tv_sec</span> <span class="o">+</span> <span class="n">ts</span><span class="p">.</span><span class="n">tv_nsec</span> <span class="o">/</span> <span class="mf">1000000000.0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">size_t</span> <span class="n">mb</span> <span class="o">=</span> <span class="n">argc</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="nf">strtoull</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o">:</span> <span class="mi">512</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">size_t</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">mb</span> <span class="o">*</span> <span class="mi">1024ULL</span> <span class="o">*</span> <span class="mi">1024ULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="o">*</span><span class="n">dst</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">posix_memalign</span><span class="p">((</span><span class="kt">void</span> <span class="o">**</span><span class="p">)</span><span class="o">&amp;</span><span class="n">src</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">bytes</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">posix_memalign</span><span class="p">((</span><span class="kt">void</span> <span class="o">**</span><span class="p">)</span><span class="o">&amp;</span><span class="n">dst</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">bytes</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">memset</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="mh">0x5a</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">memset</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">t0</span> <span class="o">=</span> <span class="nf">now_sec</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">r</span> <span class="o">&lt;</span> <span class="mi">20</span><span class="p">;</span> <span class="o">++</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">memcpy</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">t1</span> <span class="o">=</span> <span class="nf">now_sec</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">gb</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">bytes</span> <span class="o">*</span> <span class="mf">20.0</span> <span class="o">/</span> <span class="mi">1024</span> <span class="o">/</span> <span class="mi">1024</span> <span class="o">/</span> <span class="mi">1024</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;copy bandwidth: %.2f GB/s</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">,</span> <span class="n">gb</span> <span class="o">/</span> <span class="p">(</span><span class="n">t1</span> <span class="o">-</span> <span class="n">t0</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">free</span><span class="p">(</span><span class="n">src</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">free</span><span class="p">(</span><span class="n">dst</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>编译运行：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">gcc -O3 -march<span class="o">=</span>native memcopy_bw.c -o memcopy_bw
</span></span><span class="line"><span class="cl">./memcopy_bw <span class="m">1024</span>
</span></span></code></pre></div><p>注意，这个结果主要反映大块连续拷贝能力，还受到 libc <code>memcpy</code> 实现、CPU 预取、Cache 策略影响。它适合做“平台状态是否正常”的健康检查，但不能代表所有业务。</p>
<h3 id="32-随机访问延迟">3.2 随机访问延迟</h3>
<p>很多控制类、图结构、稀疏张量、数据库索引类负载不是带宽优先，而是延迟优先。连续带宽高的平台，如果随机访问延迟很差，业务一样会慢。可以用 <code>lmbench</code> 的 <code>lat_mem_rd</code>，也可以自己构造链表追指针测试。核心思路是让下一次访问依赖上一次读取结果，破坏 CPU 预取器的发挥空间。</p>
<h3 id="33-混压测试">3.3 混压测试</h3>
<p>真实 SoC 里最关键的是混压：CPU 跑内存测试的同时，让摄像头采集、显示刷新、NPU 推理、视频编码一起工作。很多问题只有在混压下出现，因为 DDR controller 和 NoC 仲裁策略这时才真正被打满。</p>
<p>我通常会记录三组数据：</p>
<ul>
<li>空载下的连续带宽和随机延迟；</li>
<li>业务单独运行时的 DDR 计数器和帧率；</li>
<li>基准测试与业务同时运行时的延迟抖动。</li>
</ul>
<p>如果单独测试都很好，一混压就抖，优先查 QoS、DMA burst、内存拷贝和任务绑核，而不是急着改 DDR 时序。</p>
<p>（第一部分完，约 2200 字）</p>
<h2 id="四用-perf-先判断是不是-cache-问题">四、用 perf 先判断是不是 Cache 问题</h2>
<p>在 Linux 上，第一步通常不是直接看 DDR controller，而是先看 CPU 的硬件性能计数器。因为很多所谓“内存带宽不够”，根因其实是 Cache 使用方式太差。</p>
<p>常用命令如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">perf stat -e cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses ./your_app
</span></span></code></pre></div><p>如果平台事件支持更完整，还可以看：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">perf stat -e stalled-cycles-frontend,stalled-cycles-backend,branch-misses,dTLB-load-misses ./your_app
</span></span></code></pre></div><p>几个经验判断：</p>
<ul>
<li><strong>IPC 很低，backend stall 很高</strong>：CPU 大概率在等内存或执行单元资源。</li>
<li><strong>LLC miss 比例高</strong>：数据工作集超出缓存，或者访问局部性差。</li>
<li><strong>dTLB miss 高</strong>：大数组随机访问、页表压力大，可以考虑 hugepage 或改善布局。</li>
<li><strong>cache-misses 高但 DDR 带宽不高</strong>：可能是随机小访问导致延迟瓶颈，而不是带宽瓶颈。</li>
</ul>
<p>一个非常典型的例子是 AoS（Array of Structs）和 SoA（Struct of Arrays）的差异。假设我们只需要处理像素的亮度 <code>y</code>，但数据结构却把多个字段混在一起：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">y</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">u</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">v</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">flag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">timestamp</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">PixelMeta</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint64_t</span> <span class="nf">sum_y_aos</span><span class="p">(</span><span class="n">PixelMeta</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">s</span> <span class="o">+=</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">y</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>CPU 每次拉回 Cache Line，里面包含很多当前循环不需要的字段。改成 SoA 后，访问会连续得多：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">y</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">flag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">timestamp</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">PixelPlane</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint64_t</span> <span class="nf">sum_y_soa</span><span class="p">(</span><span class="n">PixelPlane</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">s</span> <span class="o">+=</span> <span class="n">p</span><span class="o">-&gt;</span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这类优化没有改 DDR 频率，也没有碰内核，却能显著减少无效带宽。尤其在图像处理、传感器融合、推理前后处理里，数据布局往往比单纯“加线程”更重要。</p>
<h2 id="五axinoc-层别让-master-互相踩脚">五、AXI/NoC 层：别让 master 互相踩脚</h2>
<p>SoC 里的 DDR 不是 CPU 独占资源。摄像头 ISP 可能持续写入帧缓冲，显示控制器周期性读取 framebuffer，NPU 读取权重和 feature map，VPU 编码器读写码流和参考帧。它们通常都通过 AXI 或片上 NoC 进入内存系统。</p>
<p>AXI 层常见的几个调优点包括：</p>
<h3 id="51-burst-长度">5.1 Burst 长度</h3>
<p>DDR 喜欢连续访问，AXI burst 太短会带来额外命令开销。对 DMA 来说，尽量让传输地址连续、长度对齐、burst 足够长。比如图像一行 stride 如果没有按 64/128 字节对齐，DMA 可能被迫拆成更多事务。</p>
<p>在驱动里申请 DMA buffer 时，至少要确认：</p>
<ul>
<li>起始地址是否满足硬件对齐；</li>
<li>每行 stride 是否满足模块要求；</li>
<li>buffer 是否跨越硬件不支持的边界；</li>
<li>是否发生了 cache sync 导致额外拷贝或刷写。</li>
</ul>
<h3 id="52-outstanding-能力">5.2 Outstanding 能力</h3>
<p>AXI master 可以同时挂起多个未完成事务。Outstanding 太小，延迟无法被隐藏；太大，又可能挤压其他实时 master。NPU/GPU 这类吞吐型设备通常需要较大的 outstanding，显示、音频、某些实时采集链路则更关心延迟上限。</p>
<p>如果芯片手册提供 NoC 或 DDR port 的 outstanding 配置，建议不要盲目拉满，而是按业务分组测试：</p>
<ul>
<li>单 NPU 推理吞吐；</li>
<li>NPU + ISP；</li>
<li>NPU + ISP + VPU；</li>
<li>加入 CPU 后处理线程。</li>
</ul>
<p>看的是整体帧率和 P99 延迟，而不是某一个模块的峰值。</p>
<h3 id="53-qos-优先级">5.3 QoS 优先级</h3>
<p>很多 SoC 的 AXI port 有 QoS 字段或内部仲裁权重。显示控制器、音频、摄像头输入这类实时流，一旦饿死就会花屏、爆音或丢帧；AI 推理慢一点通常只是延迟增加。因此，QoS 的目标不是让所有模块“公平”，而是让实时链路有确定性，让吞吐型模块吃剩余带宽。</p>
<p>一个实用策略是：</p>
<ol>
<li>先保证显示/采集链路不丢；</li>
<li>再给编码器、NPU 设置中等优先级；</li>
<li>CPU 后台任务、日志、文件 IO 降低优先级；</li>
<li>对吞吐型 DMA 使用大 burst，但限制 outstanding，避免长时间占住通道。</li>
</ol>
<h2 id="六ddr-controllerrow-hit读写切换和刷新">六、DDR Controller：Row Hit、读写切换和刷新</h2>
<p>到了 DDR controller 层，问题会变得更“硬件”。这里的核心是调度器如何把上层来的请求转换成 DRAM 命令序列。</p>
<p>DRAM 内部按 bank、row、column 组织。访问已经打开的 row，叫 row hit；如果要访问另一个 row，就需要 precharge 当前 row，再 activate 新 row，开销明显更大。因此连续访问、按行访问、减少跨 bank/row 的随机跳转，通常能提升效率。</p>
<p>控制器还要处理读写切换。读和写在总线上方向不同，频繁切换会产生 turnaround penalty。很多 controller 会倾向于攒一批读或一批写再切换，以提高总线效率；但如果攒得太久，实时写入或读取的延迟可能变差。</p>
<p>刷新也是不可忽略的因素。DRAM 需要周期性 refresh，温度越高，刷新压力可能越大。某些项目在高温箱里出现周期性延迟尖峰，最后发现不是 CPU 调度问题，而是内存刷新和业务峰值叠加。</p>
<p>如果平台暴露 DDR controller 计数器，建议关注：</p>
<ul>
<li>read/write command 数量；</li>
<li>row hit / row miss；</li>
<li>bank conflict；</li>
<li>refresh 周期；</li>
<li>port busy 或 queue full；</li>
<li>各 master 的带宽占比。</li>
</ul>
<p>不同厂商接口差异很大，有的在 <code>/sys</code>，有的通过 <code>devfreq</code>，有的要读寄存器。不要只看一个总带宽数字，最好能按 master 或 port 拆开，否则很难知道谁在消耗带宽。</p>
<h2 id="七linux-侧常见的隐藏带宽消耗">七、Linux 侧常见的隐藏带宽消耗</h2>
<p>在应用层，最浪费 DDR 的通常不是计算，而是“搬来搬去”。视频和 AI pipeline 里尤其明显：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Camera -&gt; ISP -&gt; 内存 -&gt; CPU memcpy -&gt; NPU input -&gt; 内存 -&gt; CPU 后处理 -&gt; VPU/Display
</span></span></code></pre></div><p>如果每个箭头都落 DDR，每个模块之间再做一次格式转换，带宽很快就被吃光。优化方向包括：</p>
<ul>
<li>使用 DMA-BUF 在模块之间共享 buffer；</li>
<li>尽量让 ISP/RGA/NPU/VPU 直接处理物理连续或 IOMMU 映射后的 buffer；</li>
<li>减少 CPU 参与大块图像拷贝；</li>
<li>统一颜色格式，避免 NV12/RGB/BGR 来回转换；</li>
<li>对只读权重、查找表使用合适的缓存策略；</li>
<li>对一次性 DMA buffer 避免不必要的 cache invalidate/clean。</li>
</ul>
<p>很多时候，把两次 <code>memcpy</code> 去掉，比把 DDR 频率从 3200 提到 4266 更有效，也更省电。</p>
<p>（第二部分完，约 2400 字）</p>
<h2 id="八一个可复用的带宽排查脚本">八、一个可复用的带宽排查脚本</h2>
<p>下面这个 Python 脚本用于自动跑不同 buffer 大小下的拷贝测试，并记录 <code>perf stat</code> 的关键指标。实际项目里我会把它放进 bring-up 工具箱，每次改 DDR 频率、内核版本、驱动 DMA 策略后都跑一遍。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="ch">#!/usr/bin/env python3</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">csv</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">subprocess</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">SIZES_MB</span> <span class="o">=</span> <span class="p">[</span><span class="mi">64</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">1024</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">EVENTS</span> <span class="o">=</span> <span class="s2">&#34;cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">BIN</span> <span class="o">=</span> <span class="s2">&#34;./memcopy_bw&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">perf_re</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^\s*([0-9,]+)\s+([A-Za-z0-9_-]+)&#34;</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">M</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">bw_re</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;copy bandwidth:\s+([0-9.]+)\s+GB/s&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">run_one</span><span class="p">(</span><span class="n">size_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;perf&#34;</span><span class="p">,</span> <span class="s2">&#34;stat&#34;</span><span class="p">,</span> <span class="s2">&#34;-e&#34;</span><span class="p">,</span> <span class="n">EVENTS</span><span class="p">,</span> <span class="n">BIN</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">size_mb</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">    <span class="n">p</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span> <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span> <span class="n">check</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">out</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">stdout</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">stderr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">row</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;size_mb&#34;</span><span class="p">:</span> <span class="n">size_mb</span><span class="p">,</span> <span class="s2">&#34;bandwidth_gbps&#34;</span><span class="p">:</span> <span class="kc">None</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="n">m</span> <span class="o">=</span> <span class="n">bw_re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">m</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">row</span><span class="p">[</span><span class="s2">&#34;bandwidth_gbps&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">value</span><span class="p">,</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">perf_re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">out</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">row</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">value</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&#34;,&#34;</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">row</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">rows</span> <span class="o">=</span> <span class="p">[</span><span class="n">run_one</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">SIZES_MB</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">keys</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">({</span><span class="n">k</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rows</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">r</span><span class="o">.</span><span class="n">keys</span><span class="p">()})</span>
</span></span><span class="line"><span class="cl"><span class="n">Path</span><span class="p">(</span><span class="s2">&#34;results&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;results/memory_perf.csv&#34;</span><span class="p">,</span> <span class="s2">&#34;w&#34;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s2">&#34;&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">keys</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">writer</span><span class="o">.</span><span class="n">writerows</span><span class="p">(</span><span class="n">rows</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rows</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">miss</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;cache-misses&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ref</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;cache-references&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">miss_rate</span> <span class="o">=</span> <span class="n">miss</span> <span class="o">/</span> <span class="n">ref</span> <span class="o">*</span> <span class="mi">100</span> <span class="k">if</span> <span class="n">ref</span> <span class="k">else</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">r</span><span class="p">[</span><span class="s1">&#39;size_mb&#39;</span><span class="p">]</span><span class="si">:</span><span class="s2">4d</span><span class="si">}</span><span class="s2"> MB  </span><span class="si">{</span><span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;bandwidth_gbps&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="si">:</span><span class="s2">6.2f</span><span class="si">}</span><span class="s2"> GB/s  cache miss </span><span class="si">{</span><span class="n">miss_rate</span><span class="si">:</span><span class="s2">5.2f</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>这个脚本不能替代专业测试，但有两个好处：第一，它让每次优化都有数据记录；第二，它能快速发现“改动看似无关，内存行为却变了”的情况。例如某次驱动修改把 buffer 从 cacheable 改成 non-cacheable，CPU 后处理性能会立刻掉下来；某次设备树改错 DDR devfreq 档位，连续带宽也会明显变化。</p>
<h2 id="九实战调优顺序不要一上来就改-ddr-参数">九、实战调优顺序：不要一上来就改 DDR 参数</h2>
<p>DDR 参数很诱人，因为它看起来离瓶颈最近。但在量产项目里，随意修改 DDR training、ODT、时序、频率，风险远高于收益。我的建议顺序是：</p>
<h3 id="91-先确认频率和工作模式">9.1 先确认频率和工作模式</h3>
<p>检查 DDR 是否跑在预期频率，devfreq 是否被省电策略压低，双通道是否都启用，位宽是否符合硬件设计。很多“性能问题”最后只是设备树频点、bootloader 初始化或电源模式配置不对。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">cat /sys/class/devfreq/*/cur_freq 2&gt;/dev/null
</span></span><span class="line"><span class="cl">cat /sys/class/devfreq/*/available_frequencies 2&gt;/dev/null
</span></span></code></pre></div><p>不同平台节点名称不一样，上面命令只是示意。关键是把空载、业务运行、混压时的频率都记录下来。</p>
<h3 id="92-再减少无效流量">9.2 再减少无效流量</h3>
<p>优先去掉重复拷贝、格式来回转换、日志大吞吐写盘、调试 overlay、无意义的 buffer 清零。尤其是图像类项目，<code>memset</code> 和 <code>memcpy</code> 经常藏在看起来不起眼的封装函数里。</p>
<p>可以临时用 <code>LD_PRELOAD</code> 包装 <code>memcpy</code> 做统计，也可以在代码里对大块拷贝加 trace。不要凭感觉判断，很多团队最后会惊讶地发现，CPU 每帧搬运的数据量比原始图像大几倍。</p>
<h3 id="93-然后优化访问局部性">9.3 然后优化访问局部性</h3>
<p>包括数据结构从 AoS 改 SoA、循环顺序调整、tile/block 处理、预取、对齐、减少随机访问。矩阵和图像算法里，分块往往非常有效：让工作集留在 L1/L2 里，而不是每一步都回 DDR。</p>
<p>一个简单的二维数组遍历例子：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 差：按列访问，跨行 stride 大，Cache 利用率低
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="o">++</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">acc</span> <span class="o">+=</span> <span class="n">img</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 好：按行访问，连续读取
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="o">++</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">row</span> <span class="o">=</span> <span class="n">img</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">acc</span> <span class="o">+=</span> <span class="n">row</span><span class="p">[</span><span class="n">x</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="94-最后再碰-qos-和-ddr-控制器">9.4 最后再碰 QoS 和 DDR 控制器</h3>
<p>当确认无效流量已经压下去、访问模式也合理，但混压下仍然抖动，再去调 QoS、outstanding、读写队列、水线、DDR devfreq governor。每次只改一个变量，并记录平均值、P95、P99，而不是只看峰值。</p>
<h2 id="十几个常见故障现象和定位方向">十、几个常见故障现象和定位方向</h2>
<h3 id="101-单测很快业务很慢">10.1 单测很快，业务很慢</h3>
<p>优先怀疑多 master 竞争、额外拷贝和 Cache 同步。连续内存测试无法复现业务的读写混合和实时约束，必须做混压。</p>
<h3 id="102-平均帧率够偶发掉帧">10.2 平均帧率够，偶发掉帧</h3>
<p>看 P99 延迟、调度抢占、DDR refresh、QoS、水温/温度导致的降频。显示、摄像头、音频类问题尤其要关注最坏情况，而不是平均值。</p>
<h3 id="103-cpu-占用不高但程序慢">10.3 CPU 占用不高但程序慢</h3>
<p>可能是 backend stall，线程在等内存。用 <code>perf stat</code> 看 IPC、cache miss、dTLB miss，再结合火焰图看热点是不是大数组随机访问。</p>
<h3 id="104-npu-利用率上不去">10.4 NPU 利用率上不去</h3>
<p>不一定是模型小，也可能是输入预处理、tensor layout 转换、权重读取、NPU 与 CPU/NPU 共享 DDR 造成等待。检查是否支持零拷贝输入，是否每帧都做了不必要的 NHWC/NCHW 转换。</p>
<h3 id="105-高温后性能下降">10.5 高温后性能下降</h3>
<p>检查 DDR devfreq、CPU/GPU/NPU 降频、DRAM refresh、PMIC 限流。高温问题不要只盯 CPU 温度，内存和电源策略也会影响吞吐。</p>
<h2 id="十一量产项目里的建议清单">十一、量产项目里的建议清单</h2>
<p>最后给一份我在项目评审里常用的 checklist：</p>
<ul>
<li>DDR 频率、位宽、通道数是否和硬件设计一致；</li>
<li>bootloader 与内核里的 DDR/devfreq 配置是否一致；</li>
<li>是否有 STREAM、随机延迟、业务混压三类基线数据；</li>
<li>是否记录了 CPU PMU、DDR controller、NoC/AXI port 计数器；</li>
<li>视频/AI pipeline 是否做到 DMA-BUF 或等价的零拷贝；</li>
<li>大 buffer 是否对齐，stride 是否满足 DMA burst；</li>
<li>CPU 是否存在大块 <code>memcpy</code>、<code>memset</code>、格式转换；</li>
<li>热点数据结构是否有良好的空间局部性；</li>
<li>实时 master 的 QoS 是否高于吞吐型后台任务；</li>
<li>是否用 P95/P99 延迟评估，而不是只看平均吞吐；</li>
<li>高温、低电压、省电模式下是否重复验证。</li>
</ul>
<h2 id="总结">总结</h2>
<p>DDR 带宽调优不是单点优化，而是一条链路的系统工程。CPU 看到的是 Cache Miss，DMA 看到的是 burst 和对齐，NoC 看到的是仲裁和 outstanding，DDR controller 看到的是 row hit、读写切换和刷新，业务最终看到的是帧率、延迟和稳定性。</p>
<p>真正有效的调优顺序应该是：先建立基线，再确认频率和硬件配置；先减少无效拷贝，再改善数据局部性；先通过 perf 和业务 trace 定位瓶颈，再去调整 QoS、NoC 和 DDR controller。除非你已经有充分数据证明瓶颈在内存控制器，否则不要一上来就改 DDR 时序。</p>
<p>如果只能记住一句话，那就是：<strong>内存系统优化的目标不是跑出最高的 GB/s，而是在真实业务混压下，把有效数据稳定、可预测地送到需要它的计算单元。</strong> 这也是芯片架构和嵌入式软件最有意思的交界处——硬件给了你上限，软件决定你能接近多少。</p>
<p>（全文完，约 6800 字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
