<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Cache on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/cache/</link>
    <description>Recent content in Cache on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Mon, 08 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/cache/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Cortex-M Cache、MPU 与 DMA 一致性实战：把 STM32H7 这类高性能 MCU 跑稳跑快</title>
      <link>https://tech-snippets.xyz/posts/arm-cortex-m-cache-mpu-dma-coherency-guide/</link>
      <pubDate>Mon, 08 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/arm-cortex-m-cache-mpu-dma-coherency-guide/</guid>
      <description>前言：高性能 MCU 最隐蔽的坑，不是算力不够，而是数据“不一致” 很多人第一次从 Cortex-M3 / Cortex-M4 迁移到 Cortex-M7，感受非常直接：主频更高了，FPU 更强了，片上 SRAM 更大了，外设带宽也上来了。以 STM32H7、NXP i.MX RT、部分国产高性能 MCU 为例，系统里开始出现 I-Cache、D-Cache、AXI SRAM、多级总线矩阵、MDMA、ETH、SDMMC、DCMI、LTDC 这类过去在小 MCU 上不太需要认真处理的模块。代码还是 C，外设还是 DMA，调试器还是能单步，但一旦项目进入图像采集、以太网、文件系统、音频流或者屏幕刷新，问题会变得很诡异：
DMA 明明已经写完了缓冲区，CPU 读到的还是旧数据； CPU 明明把发送包填好了，以太网 DMA 发出去的却是上一帧； 关掉 D-Cache 后系统稳定了，但吞吐掉了一大截； 加了一句 SCB_CleanDCache_by_Addr() 后偶尔好、偶尔坏； 同样的代码 Debug 版本正常，Release 版本或者换了优化等级就出错； 缓冲区长度不是 32 字节倍数时，旁边的变量被“莫名其妙”污染。 这些现象的根源通常不是外设驱动写错，也不是编译器“玄学”，而是 CPU、Cache、MPU、DMA 对同一段内存的理解不一致。Cortex-M7 的 D-Cache 提升了 CPU 访问速度，但 DMA 控制器通常不会经过 D-Cache，它直接从 SRAM 或外部 RAM 读写。于是同一个地址，在 CPU 看来可能是 Cache Line 里的新数据，在 DMA 看来却是内存里的旧数据；反过来，DMA 已经把新数据写入内存，CPU 仍然命中旧的 Cache Line。</description>
      <content:encoded><![CDATA[<h2 id="前言高性能-mcu-最隐蔽的坑不是算力不够而是数据不一致">前言：高性能 MCU 最隐蔽的坑，不是算力不够，而是数据“不一致”</h2>
<p>很多人第一次从 Cortex-M3 / Cortex-M4 迁移到 Cortex-M7，感受非常直接：主频更高了，FPU 更强了，片上 SRAM 更大了，外设带宽也上来了。以 STM32H7、NXP i.MX RT、部分国产高性能 MCU 为例，系统里开始出现 I-Cache、D-Cache、AXI SRAM、多级总线矩阵、MDMA、ETH、SDMMC、DCMI、LTDC 这类过去在小 MCU 上不太需要认真处理的模块。代码还是 C，外设还是 DMA，调试器还是能单步，但一旦项目进入图像采集、以太网、文件系统、音频流或者屏幕刷新，问题会变得很诡异：</p>
<ul>
<li>DMA 明明已经写完了缓冲区，CPU 读到的还是旧数据；</li>
<li>CPU 明明把发送包填好了，以太网 DMA 发出去的却是上一帧；</li>
<li>关掉 D-Cache 后系统稳定了，但吞吐掉了一大截；</li>
<li>加了一句 <code>SCB_CleanDCache_by_Addr()</code> 后偶尔好、偶尔坏；</li>
<li>同样的代码 Debug 版本正常，Release 版本或者换了优化等级就出错；</li>
<li>缓冲区长度不是 32 字节倍数时，旁边的变量被“莫名其妙”污染。</li>
</ul>
<p>这些现象的根源通常不是外设驱动写错，也不是编译器“玄学”，而是 CPU、Cache、MPU、DMA 对同一段内存的理解不一致。Cortex-M7 的 D-Cache 提升了 CPU 访问速度，但 DMA 控制器通常不会经过 D-Cache，它直接从 SRAM 或外部 RAM 读写。于是同一个地址，在 CPU 看来可能是 Cache Line 里的新数据，在 DMA 看来却是内存里的旧数据；反过来，DMA 已经把新数据写入内存，CPU 仍然命中旧的 Cache Line。</p>
<p>本文不把重点放在寄存器手册逐位翻译上，而是从工程落地角度讲清楚三个问题：第一，Cache、MPU、DMA 为什么会互相影响；第二，如何设计可维护的内存区域和缓冲区策略；第三，怎样写出能在网络、摄像头、SD 卡、屏幕刷新场景里长期稳定运行的代码。文章示例偏向 STM32H7 / Cortex-M7，但方法同样适用于其他带 D-Cache、MPU 和 DMA 的高性能 MCU。</p>
<p><img alt="Cortex-M Cache、MPU 与 DMA 一致性关系" loading="lazy" src="/images/arm-cortex-m-cache-mpu-dma-coherency.svg"></p>
<h2 id="一先建立一个基本模型cpu-看-cachedma-看内存">一、先建立一个基本模型：CPU 看 Cache，DMA 看内存</h2>
<p>在传统 Cortex-M0 / M3 / M4 项目里，我们常常把“地址”和“数据”简单绑定：某个指针指向 SRAM，CPU 写了什么，DMA 就能读到什么；DMA 写了什么，CPU 再读就能看到什么。这个模型在没有 D-Cache 的系统里基本成立，最多需要考虑 <code>volatile</code>、中断竞争和总线带宽。</p>
<p>到了 Cortex-M7，模型必须改成：CPU 访问某个地址时，可能先访问 D-Cache；DMA 访问某个地址时，通常直接访问 SRAM、AXI SRAM、DTCM 或外部 SDRAM。D-Cache 的常见 Cache Line 大小是 32 字节，CPU 写入一个变量时，可能只是把对应 Cache Line 标记为 dirty，还没真正写回内存。CPU 读取一个地址时，如果 Cache 命中，也可能根本不去内存取 DMA 刚写入的新内容。</p>
<p>这就是所谓 Cache 一致性问题。桌面 CPU 或高端 SoC 往往有硬件一致性协议，多个核心、DMA、外设通过 ACE、CHI 等协议维持一致。但很多 MCU 的 DMA 并不参与 D-Cache 一致性协议，维护责任就落到了软件身上。</p>
<h3 id="三类常见内存的差异">三类常见内存的差异</h3>
<p>以 STM32H7 为例，开发者经常会遇到 DTCM RAM、AXI SRAM、SRAM1 / SRAM2 / SRAM3、外部 SDRAM 等区域。它们不只是容量不同，访问路径也不同：</p>
<ol>
<li><strong>DTCM RAM</strong>：CPU 访问非常快，适合栈、实时控制变量、算法中间状态。但不少 DMA 外设无法访问 DTCM，所以把 DMA 缓冲区放在 DTCM 里会直接失败，表现为 DMA 不搬运或外设无数据。</li>
<li><strong>AXI SRAM</strong>：挂在 AXI 总线矩阵上，CPU 和很多 DMA 都能访问，是大块 DMA Buffer 的常见选择。配合 Cache 后 CPU 处理速度好，但必须做一致性维护。</li>
<li><strong>外部 SDRAM / PSRAM</strong>：容量大，适合帧缓冲、神经网络中间张量、文件缓存，但延迟和仲裁复杂，更需要 Cache，也更容易暴露刷新、对齐、带宽瓶颈。</li>
</ol>
<p>所以工程里的第一个原则是：不要只问“这段内存够不够大”，还要问“谁会访问它、通过哪条总线访问、是否经过 Cache、是否允许 DMA 访问”。</p>
<h2 id="二mpu-的价值把内存使用约定固化为硬件属性">二、MPU 的价值：把“内存使用约定”固化为硬件属性</h2>
<p>MPU（Memory Protection Unit）在很多裸机项目里被忽略，大家觉得它只是 RTOS 做任务隔离、权限保护时才需要。实际上在带 D-Cache 的 MCU 上，MPU 最常用的价值是定义内存属性：某段区域是否可缓存、是否 Bufferable、是否 Shareable、是否允许执行、读写权限如何。</p>
<p>如果没有 MPU，很多库会用默认内存属性启动 Cache。默认属性对普通代码和数据可能没问题，但对 DMA 共享缓冲区未必合适。一个成熟的高性能 MCU 项目，通常会把内存划分成几类：</p>
<ul>
<li>普通代码和常规数据：可缓存，追求 CPU 性能；</li>
<li>DMA 描述符：不可缓存或严格手动维护，追求确定性；</li>
<li>大块 DMA 数据缓冲区：可缓存但按方向 Clean / Invalidate，兼顾吞吐；</li>
<li>外设寄存器区域：Device 类型，不允许乱序和缓存；</li>
<li>帧缓冲：根据 LTDC / DMA2D / CPU 绘制比例选择 Write-through、Write-back 或 Non-cacheable。</li>
</ul>
<p>这种划分看起来麻烦，但它能把“约定”变成系统启动时的硬件配置，而不是散落在驱动代码里的注释。</p>
<h3 id="一个典型的-mpu-区域规划">一个典型的 MPU 区域规划</h3>
<p>下面是一个简化的规划，不直接对应某一块芯片的完整地址表，但足够说明思路：</p>
<table>
<thead>
<tr>
<th>区域</th>
<th>用途</th>
<th>属性建议</th>
<th>原因</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flash</td>
<td>代码、只读表</td>
<td>Normal、Cacheable、Executable</td>
<td>提高取指和查表速度</td>
</tr>
<tr>
<td>DTCM</td>
<td>栈、控制变量、实时算法</td>
<td>Normal、Non-cacheable 或 TCM 默认</td>
<td>CPU 低延迟访问，不给 DMA 用</td>
</tr>
<tr>
<td>AXI SRAM 普通段</td>
<td>堆、算法 Buffer</td>
<td>Normal、Write-back Cacheable</td>
<td>提高 CPU 处理吞吐</td>
</tr>
<tr>
<td>AXI SRAM DMA 描述符段</td>
<td>ETH、SDMMC、USB 描述符</td>
<td>Normal、Non-cacheable</td>
<td>避免描述符状态不同步</td>
</tr>
<tr>
<td>AXI SRAM DMA 数据段</td>
<td>网络包、图像块、音频块</td>
<td>Normal、Cacheable + 手动维护</td>
<td>大块数据用 Cache 提速</td>
</tr>
<tr>
<td>外设寄存器</td>
<td>GPIO、DMA、ETH 等</td>
<td>Device、Non-cacheable</td>
<td>禁止缓存和不合适的重排</td>
</tr>
</tbody>
</table>
<p>需要注意的是，MPU Region 的大小和基地址一般有对齐要求，且 Region 数量有限。不要为每个小数组都单独建 Region，而应把共享 Buffer 放到统一的链接段里，例如 <code>.dma_desc</code>、<code>.dma_buffer</code>，再用链接脚本保证对齐和边界。</p>
<h2 id="三cache-维护动作cleaninvalidate-和-cleaninvalidate-的边界">三、Cache 维护动作：Clean、Invalidate 和 CleanInvalidate 的边界</h2>
<p>Cache 维护 API 名字很像，很多 bug 就出在调用时机反了。可以用一句话记住：</p>
<ul>
<li><strong>CPU 写、DMA 读</strong>：启动 DMA 前要 Clean，让 CPU 写在 Cache 里的脏数据写回内存；</li>
<li><strong>DMA 写、CPU 读</strong>：DMA 完成后要 Invalidate，让 CPU 丢掉旧 Cache Line，下次从内存取新数据；</li>
<li><strong>双向或状态不确定</strong>：谨慎使用 CleanInvalidate，但要确认不会把 DMA 新写的数据被旧脏 Cache 覆盖。</li>
</ul>
<p>以发送网络包为例，CPU 先构造以太网帧，然后 ETH DMA 从内存读取并发送。如果 CPU 的帧内容还停留在 D-Cache，ETH DMA 读到的就是旧内存。正确动作是在把描述符交给 DMA 前，对数据 Buffer 执行 Clean。</p>
<p>以接收网络包为例，ETH DMA 把数据写入内存，然后中断通知 CPU。CPU 如果之前读过这个 Buffer，对应 Cache Line 可能还在 D-Cache 里。正确动作是在 CPU 解析包之前，对 Buffer 执行 Invalidate。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define CACHE_LINE_SIZE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uintptr_t</span> <span class="nf">align_down</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">align</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">addr</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uintptr_t</span> <span class="nf">align_up</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">align</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">addr</span> <span class="o">+</span> <span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_clean_range</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">align_down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">end</span>   <span class="o">=</span> <span class="nf">align_up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_CleanDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">start</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">align_down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">end</span>   <span class="o">=</span> <span class="nf">align_up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_InvalidateDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">start</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里最关键的细节是地址和长度按 Cache Line 对齐。很多 HAL 示例只传原始地址和长度，看似能跑，实则在非对齐缓冲区上风险很高。因为硬件维护的是整条 Cache Line，不是 C 语言对象。如果一个 20 字节 DMA Buffer 与旁边变量共享同一条 32 字节 Cache Line，Invalidate 时可能把旁边变量在 Cache 里的修改也丢掉；Clean 时也可能把不该提交的数据写回。</p>
<p>（第一部分完，约 2400 字）</p>
<h2 id="四链接脚本不要让-dma-buffer-和普通变量混住">四、链接脚本：不要让 DMA Buffer 和普通变量混住</h2>
<p>只靠 <code>__attribute__((aligned(32)))</code> 能解决一部分问题，但不够系统。对齐解决的是起点问题，不能保证链接器不会把其他变量放在同一条 Cache Line 附近，也不能保证整段区域落在 DMA 可访问的 SRAM。更稳妥的做法是为 DMA 描述符和 DMA 数据建立专门的链接段。</p>
<p>以 GCC 链接脚本为例，可以在 AXI SRAM 中划出两个区域：一个不可缓存的描述符区，一个可缓存但手动维护的数据区。实际地址需要根据芯片手册调整，这里只展示结构。</p>
<pre tabindex="0"><code class="language-ld" data-lang="ld">MEMORY
{
  FLASH   (rx)  : ORIGIN = 0x08000000, LENGTH = 2048K
  DTCMRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
  AXIRAM  (xrw) : ORIGIN = 0x24000000, LENGTH = 512K
}

SECTIONS
{
  .dma_desc (NOLOAD) : ALIGN(32)
  {
    __dma_desc_start__ = .;
    *(.dma_desc*)
    . = ALIGN(32);
    __dma_desc_end__ = .;
  } &gt; AXIRAM

  .dma_buffer (NOLOAD) : ALIGN(32)
  {
    __dma_buffer_start__ = .;
    *(.dma_buffer*)
    . = ALIGN(32);
    __dma_buffer_end__ = .;
  } &gt; AXIRAM
}
</code></pre><p>在 C 代码中可以这样声明：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define DMA_ALIGN __attribute__((aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_DESC  __attribute__((section(&#34;.dma_desc&#34;), aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_BUF   __attribute__((section(&#34;.dma_buffer&#34;), aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_DESC</span> <span class="n">ETH_DMADescTypeDef</span> <span class="n">eth_rx_desc</span><span class="p">[</span><span class="n">ETH_RX_DESC_CNT</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_DESC</span> <span class="n">ETH_DMADescTypeDef</span> <span class="n">eth_tx_desc</span><span class="p">[</span><span class="n">ETH_TX_DESC_CNT</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">DMA_BUF</span> <span class="kt">uint8_t</span> <span class="n">eth_rx_pool</span><span class="p">[</span><span class="n">ETH_RX_DESC_CNT</span><span class="p">][</span><span class="mi">1536</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_BUF</span> <span class="kt">uint8_t</span> <span class="n">eth_tx_pool</span><span class="p">[</span><span class="n">ETH_TX_DESC_CNT</span><span class="p">][</span><span class="mi">1536</span><span class="p">];</span>
</span></span></code></pre></div><p>这样做有几个好处。第一，审查 map 文件时一眼能看到 DMA 资源放在哪里；第二，MPU 可以对整段 <code>.dma_desc</code> 设置 Non-cacheable；第三，数据 Buffer 至少不会和普通全局变量混在同一条 Cache Line 里；第四，换芯片或换板子时迁移成本更低。</p>
<h2 id="五mpu-初始化顺序先配置属性再打开-cache">五、MPU 初始化顺序：先配置属性，再打开 Cache</h2>
<p>MPU 和 Cache 的初始化顺序很重要。通常建议在系统启动早期完成：关闭 MPU，配置各 Region，开启 MPU，然后再开启 I-Cache / D-Cache。若系统已经跑起来再修改某段内存属性，就要非常小心先清理、失效相关 Cache，否则会留下难以复现的状态。</p>
<p>一个简化的初始化流程如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">system_memory_attr_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_Region_InitTypeDef</span> <span class="n">MPU_InitStruct</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Disable</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* AXI SRAM: default write-back cacheable */</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Enable</span>           <span class="o">=</span> <span class="n">MPU_REGION_ENABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span>           <span class="o">=</span> <span class="n">MPU_REGION_NUMBER0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span>      <span class="o">=</span> <span class="mh">0x24000000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span>             <span class="o">=</span> <span class="n">MPU_REGION_SIZE_512KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">AccessPermission</span> <span class="o">=</span> <span class="n">MPU_REGION_FULL_ACCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span>     <span class="o">=</span> <span class="n">MPU_ACCESS_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">TypeExtField</span>     <span class="o">=</span> <span class="n">MPU_TEX_LEVEL1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">DisableExec</span>      <span class="o">=</span> <span class="n">MPU_INSTRUCTION_ACCESS_DISABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">SubRegionDisable</span> <span class="o">=</span> <span class="mh">0x00</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* DMA descriptor window: non-cacheable */</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span>           <span class="o">=</span> <span class="n">MPU_REGION_NUMBER1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span>      <span class="o">=</span> <span class="mh">0x24070000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span>             <span class="o">=</span> <span class="n">MPU_REGION_SIZE_16KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span>     <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Enable</span><span class="p">(</span><span class="n">MPU_PRIVILEGED_DEFAULT</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_EnableICache</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_EnableDCache</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这段代码不能原样复制到所有项目，因为不同芯片的 HAL 宏、地址、Region 大小和默认属性会不同。真正需要复制的是原则：外设寄存器保持 Device 属性；普通内存尽量 Cacheable；描述符区优先 Non-cacheable；大块数据区如果追求吞吐，就 Cacheable 加维护函数；所有区域基址和大小满足 MPU 规则。</p>
<h2 id="六发送方向案例cpu-生产数据dma-消费数据">六、发送方向案例：CPU 生产数据，DMA 消费数据</h2>
<p>发送方向是最容易讲清楚的场景。假设我们用 SPI DMA 发送一段采样数据，或者用 ETH DMA 发送网络帧。数据流是：CPU 写 Buffer，DMA 读 Buffer，外设发送。</p>
<p>正确流程应该是：</p>
<ol>
<li>CPU 填写 Buffer；</li>
<li>如果 Buffer 位于 Cacheable 区域，执行 Clean；</li>
<li>设置 DMA 源地址、长度、方向；</li>
<li>必要时执行内存屏障；</li>
<li>启动 DMA；</li>
<li>DMA 完成后释放 Buffer 或进入下一轮。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">eth_send_frame</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">len</span> <span class="o">&gt;</span> <span class="mi">1536</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* CPU 已经把以太网帧写入 buf。启动 DMA 前必须写回内存。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">__DMB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* 下面是假想接口：把 buf 交给 DMA 描述符 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">eth_tx_desc_prepare</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">eth_tx_kick_dma</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里有一个容易被忽视的问题：如果描述符本身也在 Cacheable 区域，除了 Clean 数据 Buffer，还要 Clean 描述符。很多以太网驱动会把描述符设为 Non-cacheable，这样 CPU 改描述符后 DMA 立刻能看到，驱动逻辑更简单。代价是 CPU 访问描述符略慢，但描述符很小，这点开销通常可以接受。</p>
<p>对于 SDMMC 写卡、QSPI DMA 写 Flash、SAI / I2S 播放音频，道理完全一样。只要方向是 CPU 先写、DMA 后读，核心动作就是 Clean。</p>
<h2 id="七接收方向案例dma-生产数据cpu-消费数据">七、接收方向案例：DMA 生产数据，CPU 消费数据</h2>
<p>接收方向更容易出错，因为 CPU 可能在 DMA 完成前无意中读过 Buffer。例如网络栈初始化时清零、调试打印、协议栈预取，都会让对应 Cache Line 进入 D-Cache。DMA 完成后如果不 Invalidate，CPU 解析的仍然可能是旧内容。</p>
<p>正确流程应该是：</p>
<ol>
<li>准备空 Buffer；</li>
<li>如果 Buffer 曾被 CPU 写过，启动 DMA 前可根据情况 Clean 或 CleanInvalidate；</li>
<li>启动 DMA，让外设写入内存；</li>
<li>等待完成中断或轮询完成标志；</li>
<li>CPU 读取前执行 Invalidate；</li>
<li>解析 Buffer。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">camera_capture_one</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">frame</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="cm">/* 如果之前 CPU 清过 frame，且该区域可缓存，启动前先清理到内存。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">dcmi_start_dma</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">wait_frame_done</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">dcmi_stop_dma</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* DMA 已写完，CPU 读取前丢弃旧 Cache Line。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>有些资料会建议“接收前 Invalidate 一次，接收后再 Invalidate 一次”。这并非完全没有道理：接收前 Invalidate 可以避免某条 dirty Cache Line 在后续被替换时写回，覆盖 DMA 新数据；接收后 Invalidate 则确保 CPU 看到 DMA 结果。但工程上更推荐避免 CPU 在 DMA 运行期间访问同一 Buffer，并通过 Buffer 状态机明确所有权。如果所有权混乱，再多维护函数也只是降低概率，不是根治。</p>
<h2 id="八双缓冲与环形队列用所有权模型替代到处加维护函数">八、双缓冲与环形队列：用所有权模型替代“到处加维护函数”</h2>
<p>高吞吐外设通常不会只用一个 Buffer。摄像头有帧缓冲，音频有 ping-pong Buffer，网络有 RX / TX Descriptor Ring，SD 卡有块缓存。此时最重要的不是多调用几个 Cache API，而是建立清晰的 Buffer 所有权模型。</p>
<p>可以把每个 Buffer 的状态定义为：</p>
<ul>
<li><code>FREE</code>：CPU 可以填充或交给外设；</li>
<li><code>DMA_OWNED</code>：DMA 正在读写，CPU 禁止访问；</li>
<li><code>CPU_OWNED</code>：DMA 已完成，CPU 可以解析或修改；</li>
<li><code>QUEUED</code>：已经放入协议栈或应用队列，等待消费。</li>
</ul>
<p>发送方向中，Buffer 从 <code>CPU_OWNED</code> 变成 <code>DMA_OWNED</code> 的瞬间执行 Clean；接收方向中，Buffer 从 <code>DMA_OWNED</code> 变成 <code>CPU_OWNED</code> 的瞬间执行 Invalidate。维护动作绑定在状态转换点，而不是散落在业务代码里。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_FREE</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_CPU_OWNED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_DMA_OWNED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_QUEUED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">buf_state_t</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">volatile</span> <span class="kt">buf_state_t</span> <span class="n">state</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">dma_buf_t</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">rx_dma_complete_isr</span><span class="p">(</span><span class="kt">dma_buf_t</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">actual_len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">=</span> <span class="n">actual_len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">actual_len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">BUF_CPU_OWNED</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">tx_submit</span><span class="p">(</span><span class="kt">dma_buf_t</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">BUF_DMA_OWNED</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_submit_to_hw</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这种写法还有一个额外好处：后续如果你把某些 Buffer 改成 Non-cacheable，只需要调整 <code>dma_clean_range()</code> / <code>dma_invalidate_range()</code> 的实现或内存属性，不必到业务层逐个删代码。</p>
<p>（第二部分完，约 2600 字）</p>
<h2 id="九什么时候应该把-dma-区域设为-non-cacheable">九、什么时候应该把 DMA 区域设为 Non-cacheable？</h2>
<p>既然手动维护 Cache 这么容易出错，那是不是把所有 DMA Buffer 都设成 Non-cacheable 就好？答案是：可以，但要看数据规模和访问模式。</p>
<p>适合 Non-cacheable 的对象通常有三类。第一类是描述符、状态字、门铃寄存器镜像这类小对象，它们频繁被 CPU 和 DMA 交替访问，数据量小，追求确定性大于追求 Cache 命中。第二类是低速外设的小包 Buffer，例如 UART DMA 收几十字节命令，Cache 带来的收益有限。第三类是调试阶段，为了尽快排除一致性问题，可以临时把共享区设为 Non-cacheable，确认问题是否由 Cache 引起。</p>
<p>不适合 Non-cacheable 的对象也很典型：摄像头整帧、LCD Framebuffer、神经网络输入输出张量、网络大吞吐数据池、文件系统块缓存。如果 CPU 会对这些数据做大量扫描、拷贝、颜色转换、校验或协议解析，完全关闭 Cache 往往会让性能掉到不可接受。比如 800×480 的 RGB565 帧缓冲接近 750KB，CPU 绘制 UI 时如果每次都直接打到外部 SDRAM，刷新率和响应都会很难看。</p>
<p>因此更实用的策略是：<strong>描述符 Non-cacheable，大数据 Cacheable；维护动作集中封装；所有 Buffer 32 字节对齐；禁止 CPU 和 DMA 同时拥有同一 Buffer。</strong></p>
<h2 id="十性能调优一致性正确只是第一步">十、性能调优：一致性正确只是第一步</h2>
<p>很多项目修完 Cache bug 后才发现，性能仍然不理想。原因是 Cache 一致性解决的是“数据对不对”，而吞吐还受总线仲裁、内存 Bank、突发长度、访问模式影响。</p>
<h3 id="1-减少无意义的整帧-clean--invalidate">1. 减少无意义的整帧 Clean / Invalidate</h3>
<p>维护 Cache 也有成本。对一个几百 KB 的帧缓冲整帧 Invalidate，会占用明显时间。如果 DMA 实际只写了一个 ROI 区域，就只维护对应范围；如果网络包实际长度是 300 字节，不要对整个 1536 字节池做维护。</p>
<h3 id="2-避免-cpu-在-dma-期间扫描同一区域">2. 避免 CPU 在 DMA 期间扫描同一区域</h3>
<p>有些代码为了显示进度或做调试，会在 DMA 接收过程中读取 Buffer 的前几个字节。这会让 Cache 状态复杂化，也可能造成总线竞争。更好的做法是用单独的状态变量或 DMA 中断标志表示进度，不要偷看正在被 DMA 拥有的数据区。</p>
<h3 id="3-对齐不仅是-cache-line也是总线突发">3. 对齐不仅是 Cache Line，也是总线突发</h3>
<p>32 字节对齐解决 Cache Line 问题，但某些 DMA 或总线对 64 字节、128 字节边界更友好。图像和音频这类连续大块数据，尽量让行跨度、块大小、环形 Buffer 节点都接近硬件推荐的突发长度，减少跨 Bank 和非对齐传输。</p>
<h3 id="4-用-dwt-或-etm-做真实测量">4. 用 DWT 或 ETM 做真实测量</h3>
<p>不要只靠主观感觉判断优化效果。Cortex-M 提供 DWT cycle counter，可以很方便地测量 Clean、Invalidate、memcpy、协议解析耗时。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">dwt_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">CoreDebug</span><span class="o">-&gt;</span><span class="n">DEMCR</span> <span class="o">|=</span> <span class="n">CoreDebug_DEMCR_TRCENA_Msk</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CTRL</span> <span class="o">|=</span> <span class="n">DWT_CTRL_CYCCNTENA_Msk</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint32_t</span> <span class="nf">measure_invalidate_cycles</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>把测量结果打印成表格，往往比猜测有效得多。你会看到维护 64 字节、1500 字节、64KB、整帧图像的成本完全不同，也会发现某些“为了保险”的全局维护动作其实非常昂贵。</p>
<h2 id="十一常见故障排查清单">十一、常见故障排查清单</h2>
<p>遇到 DMA + Cache 相关问题时，可以按下面顺序排查，效率会高很多。</p>
<ol>
<li><strong>确认 DMA 能访问该内存区域</strong>：很多 DMA 不能访问 DTCM，先查总线矩阵和参考手册，不要只看地址是 SRAM 就默认可以。</li>
<li><strong>确认 Buffer 地址和长度对齐</strong>：地址至少 32 字节对齐，长度向上补齐，最好让链接段边界也对齐。</li>
<li><strong>确认方向对应的维护动作</strong>：CPU 写 DMA 读用 Clean；DMA 写 CPU 读用 Invalidate；描述符也要考虑。</li>
<li><strong>确认维护时机</strong>：Clean 必须在启动 DMA 前；Invalidate 必须在 DMA 完成后、CPU 读取前。</li>
<li><strong>确认没有并发访问</strong>：DMA 拥有 Buffer 期间，CPU 不要读写同一范围。</li>
<li><strong>确认 MPU 属性符合预期</strong>：读回 MPU 配置或在启动日志打印 Region 表，不要只相信 CubeMX 或默认启动文件。</li>
<li><strong>确认优化等级下屏障仍然存在</strong>：关键状态切换处加 <code>__DMB()</code> / <code>__DSB()</code>，尤其是描述符交接和中断处理路径。</li>
<li><strong>临时关闭 D-Cache 做 A/B 测试</strong>：如果关闭后问题消失，基本可以把方向锁定在一致性维护或 MPU 属性。</li>
<li><strong>检查 map 文件</strong>：确认 Buffer 没被链接到错误区域，也没有因为拼写错误导致 section 属性失效。</li>
<li><strong>检查库代码隐藏访问</strong>：协议栈、文件系统、图形库可能提前读写 Buffer，需要把所有权规则延伸到库接口边界。</li>
</ol>
<h2 id="十二一个更完整的工程封装思路">十二、一个更完整的工程封装思路</h2>
<p>实际项目里，不建议业务代码直接调用 <code>SCB_CleanDCache_by_Addr()</code>。可以建立一个很薄的 <code>dma_mem</code> 模块，统一处理对齐、属性、统计和调试断言。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* dma_mem.h */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#pragma once
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="cp">#define DMA_CACHE_LINE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_SECTION __attribute__((section(&#34;.dma_buffer&#34;), aligned(DMA_CACHE_LINE)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_DESC_SECTION __attribute__((section(&#34;.dma_desc&#34;), aligned(DMA_CACHE_LINE)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_init</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_clean</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_invalidate</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_check_range</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* dma_mem.c */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;dma_mem.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;cmsis_gcc.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;core_cm7.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uintptr_t</span> <span class="nf">down</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">v</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uintptr_t</span> <span class="nf">up</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">v</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">v</span> <span class="o">+</span> <span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_clean</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">s</span> <span class="o">=</span> <span class="nf">down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">e</span> <span class="o">=</span> <span class="nf">up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_CleanDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">e</span> <span class="o">-</span> <span class="n">s</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_invalidate</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">s</span> <span class="o">=</span> <span class="nf">down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">e</span> <span class="o">=</span> <span class="nf">up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_InvalidateDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">e</span> <span class="o">-</span> <span class="n">s</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>在 Debug 版本中，<code>dma_mem_check_range()</code> 可以检查地址是否落在允许 DMA 的内存段，长度是否超过 Buffer 边界，地址是否按 32 字节对齐。这样问题会在开发阶段暴露，而不是在客户现场变成偶发死机。</p>
<h2 id="十三把经验落到团队规范里">十三、把经验落到团队规范里</h2>
<p>Cache 一致性不是某个驱动工程师的个人习惯，而应该进入团队编码规范。建议至少写下以下约定：</p>
<ul>
<li>所有 DMA Buffer 必须使用统一宏声明，不允许随手 <code>static uint8_t buf[1024]</code>；</li>
<li>所有 DMA 描述符必须放在 <code>.dma_desc</code>；</li>
<li>所有 DMA 数据池必须放在 <code>.dma_buffer</code> 或明确标注 Non-cacheable；</li>
<li>驱动提交 Buffer 前后必须通过 <code>dma_mem_*</code> 接口维护；</li>
<li>DMA 拥有 Buffer 期间 CPU 禁止访问，调试打印也不例外；</li>
<li>新增外设驱动时必须在评审中说明内存区域、方向、维护动作和对齐策略；</li>
<li>每次修改链接脚本、MPU 初始化、外部 RAM 配置后必须跑 DMA 回归测试。</li>
</ul>
<p>回归测试可以设计得很朴素：构造不同长度、不同偏移、不同填充值的 Buffer，让 DMA 做内存到内存搬运或外设环回；CPU 在前后计算 CRC；同时测试 1 字节、31 字节、32 字节、33 字节、1500 字节、4096 字节等边界长度。很多隐藏问题会在 31 / 33 字节这种非整 Cache Line 长度上暴露。</p>
<h2 id="十四总结高性能-mcu-要用-soc-思维来写">十四、总结：高性能 MCU 要用 SoC 思维来写</h2>
<p>Cortex-M7 这类高性能 MCU 仍然保留了单片机的开发体验，但它的内存系统已经接近小型 SoC：Cache、MPU、多总线、多 SRAM 域、外部 RAM、多个 DMA 主设备同时存在。如果继续用“所有地址都等价”的旧模型写代码，项目越接近量产，越容易遇到偶发、难复现、难定位的问题。</p>
<p>把系统跑稳的关键并不复杂：先画清楚 CPU、DMA 和内存之间的访问路径；用 MPU 固化内存属性；用链接脚本隔离描述符和数据池；按方向正确执行 Clean / Invalidate；把维护动作绑定到 Buffer 所有权转换点；最后用 DWT、CRC、边界长度测试验证性能和正确性。</p>
<p>经验上，最可靠的方案往往不是“全局关闭 Cache”，也不是“到处补一行 Invalidate”，而是工程化地管理内存。描述符小而关键，可以 Non-cacheable；大块数据需要吞吐，就让它 Cacheable，但必须有严格的对齐、所有权和维护封装。这样既能保住 Cortex-M7 的性能，也能让 ETH、SDMMC、DCMI、LTDC、DMA2D 这些高带宽外设稳定工作。</p>
<p>如果你正在做 STM32H7 摄像头网关、工业以太网节点、带屏 UI 控制器、音频采集设备或端侧 AI 推理板卡，建议尽早把本文这套规范放进工程模板。越早建立内存区域和 DMA Buffer 的纪律，后期排查“偶现脏数据”的时间就越少，系统的可维护性也会高很多。</p>
<h2 id="十五附录一套最小回归用例">十五、附录：一套最小回归用例</h2>
<p>最后给一个简单但很有效的回归思路。准备一块 4KB 的 DMA 测试区，分别从偏移 0、1、15、31、32、33 字节开始测试，再覆盖长度 1、16、31、32、33、127、256、1500 字节。每次测试前由 CPU 写入递增模式，Clean 后让 DMA 搬运到另一块区域；DMA 完成后对目标区 Invalidate，再由 CPU 计算 CRC。随后反过来，让 DMA 写入源区，CPU 在 Invalidate 后校验。这个测试不依赖复杂外设，很多芯片可以用内存到内存 DMA 完成，适合作为板级 bring-up 的第一项检查。</p>
<p>如果这套边界测试能在开启 D-Cache、最高优化等级、RTOS 任务切换和中断压力同时存在的情况下连续运行数小时，说明内存属性、链接脚本、对齐封装和维护时机基本可信。反之，只要 31 字节、33 字节或非零偏移用例失败，就不要急着怀疑协议栈，先回到 Cache Line、所有权和 MPU Region 这三件事上排查。</p>
<p>（全文完，约 8000 字）</p>
]]></content:encoded>
    </item>
    <item>
      <title>DDR 内存带宽调优实战：从 AXI 总线到 Cache Miss 的 SoC 性能优化指南</title>
      <link>https://tech-snippets.xyz/posts/ddr-memory-controller-bandwidth-optimization-guide/</link>
      <pubDate>Mon, 01 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/ddr-memory-controller-bandwidth-optimization-guide/</guid>
      <description>前言 做嵌入式 Linux 或边缘 AI 项目时，很多性能问题最后都会绕回一个朴素但容易被低估的事实：算力不等于吞吐，CPU、NPU、GPU 跑得再快，只要数据喂不上去，整机性能就会被内存系统卡住。
我第一次真正意识到 DDR 带宽的重要性，是在一块多核 ARM SoC 上做 4 路摄像头视频分析。算法同事看 NPU 利用率只有 40% 左右，以为模型还可以继续加大；系统同事看 CPU 使用率也不高，以为瓶颈不在软件。直到我们把 ISP、RGA、NPU、VPU 同时压起来，再去读 DDR 控制器计数器，才发现内存读写已经接近平台可持续带宽的上限。那一刻，所谓“还有很多算力没用上”，其实只是“大家都在等内存”。
这篇文章想把这个问题讲透一点：DDR 带宽不是一个孤立参数，它贯穿了 CPU Cache、AXI/NoC 互联、DMA burst、内存控制器调度、DRAM Bank 冲突、刷新开销以及 Linux 调度策略。很多项目里大家会直接跑一个 memcpy 或 stream，看到数字不错就认为内存没问题；但真实业务往往不是连续大块搬运，而是多个主设备同时访问、读写混合、缓存命中率波动、实时任务和后台任务互相抢总线。
本文会从 SoC 视角出发，拆解一条内存访问路径，并给出一套可以落地的排查和优化方法。示例代码以 Linux 用户态为主，兼顾裸机/RTOS 下的思路。目标不是把每个 DDR 时序参数都背下来，而是建立一个工程上有用的判断框架：什么时候该看 Cache Miss，什么时候该看 AXI outstanding，什么时候该怀疑 DDR controller 的 page policy，什么时候该从数据布局和 DMA burst 入手。
一、先把“带宽”这件事说清楚 DDR 厂商手册里常见的理论带宽计算很简单：
理论带宽 = 数据总线宽度 / 8 × 数据传输速率 例如 32-bit LPDDR4X，数据速率 4266 MT/s，理论峰值约为：</description>
      <content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>做嵌入式 Linux 或边缘 AI 项目时，很多性能问题最后都会绕回一个朴素但容易被低估的事实：算力不等于吞吐，CPU、NPU、GPU 跑得再快，只要数据喂不上去，整机性能就会被内存系统卡住。</p>
<p>我第一次真正意识到 DDR 带宽的重要性，是在一块多核 ARM SoC 上做 4 路摄像头视频分析。算法同事看 NPU 利用率只有 40% 左右，以为模型还可以继续加大；系统同事看 CPU 使用率也不高，以为瓶颈不在软件。直到我们把 ISP、RGA、NPU、VPU 同时压起来，再去读 DDR 控制器计数器，才发现内存读写已经接近平台可持续带宽的上限。那一刻，所谓“还有很多算力没用上”，其实只是“大家都在等内存”。</p>
<p>这篇文章想把这个问题讲透一点：DDR 带宽不是一个孤立参数，它贯穿了 CPU Cache、AXI/NoC 互联、DMA burst、内存控制器调度、DRAM Bank 冲突、刷新开销以及 Linux 调度策略。很多项目里大家会直接跑一个 <code>memcpy</code> 或 <code>stream</code>，看到数字不错就认为内存没问题；但真实业务往往不是连续大块搬运，而是多个主设备同时访问、读写混合、缓存命中率波动、实时任务和后台任务互相抢总线。</p>
<p>本文会从 SoC 视角出发，拆解一条内存访问路径，并给出一套可以落地的排查和优化方法。示例代码以 Linux 用户态为主，兼顾裸机/RTOS 下的思路。目标不是把每个 DDR 时序参数都背下来，而是建立一个工程上有用的判断框架：什么时候该看 Cache Miss，什么时候该看 AXI outstanding，什么时候该怀疑 DDR controller 的 page policy，什么时候该从数据布局和 DMA burst 入手。</p>
<p><img alt="SoC DDR 带宽路径与调优观测点" loading="lazy" src="/images/ddr-memory-bandwidth-architecture.svg"></p>
<h2 id="一先把带宽这件事说清楚">一、先把“带宽”这件事说清楚</h2>
<p>DDR 厂商手册里常见的理论带宽计算很简单：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">理论带宽 = 数据总线宽度 / 8 × 数据传输速率
</span></span></code></pre></div><p>例如 32-bit LPDDR4X，数据速率 4266 MT/s，理论峰值约为：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">32 / 8 × 4266 = 17064 MB/s ≈ 17 GB/s
</span></span></code></pre></div><p>这个数字看起来很漂亮，但工程上最容易踩的坑，就是把理论峰值当作业务可用带宽。实际系统里至少有几类损耗：</p>
<ol>
<li><strong>协议和控制开销</strong>：DRAM 不是一个无限快的 SRAM，行打开、预充电、刷新、读写切换都会消耗周期。</li>
<li><strong>访问模式损耗</strong>：连续访问和随机访问差别巨大，同一 Row 命中和频繁换 Row 的效率完全不同。</li>
<li><strong>多主设备竞争</strong>：CPU、GPU、NPU、ISP、VPU、显示控制器、PCIe、USB 都可能通过 AXI/NoC 访问 DDR。</li>
<li><strong>Cache 行粒度放大</strong>：CPU 读一个 4 字节整数，如果它不在 Cache 里，通常会拉回一整条 Cache Line。</li>
<li><strong>软件栈额外拷贝</strong>：视频帧、网络包、AI tensor 如果在多个模块之间来回复制，带宽会被悄悄吃掉。</li>
</ol>
<p>所以我更喜欢把带宽分成三个层次：</p>
<ul>
<li><strong>理论峰值带宽</strong>：由 DDR 类型、频率、位宽决定，用于判断上限。</li>
<li><strong>平台可持续带宽</strong>：在稳定温度、电压、频率下，通过基准测试长期跑出来的数字。</li>
<li><strong>业务有效带宽</strong>：真实业务中，真正转化为有效计算的数据吞吐。</li>
</ul>
<p>优化时最重要的是第三个。一个平台跑 STREAM 能到 12 GB/s，不代表你的视觉 pipeline 就能用到 12 GB/s。如果算法访问模式很差，或者多媒体 DMA 和 CPU 同时抢总线，业务有效带宽可能只有几 GB/s，甚至更低。</p>
<h2 id="二一次内存访问到底经过了哪里">二、一次内存访问到底经过了哪里</h2>
<p>从 CPU 角度看，一行 C 代码可能只是：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="n">sum</span> <span class="o">+=</span> <span class="n">buffer</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span></span></code></pre></div><p>但在 SoC 内部，这次读取可能经历下面的路径：</p>
<ol>
<li>CPU 先查 L1 Data Cache；</li>
<li>L1 miss 后查 L2；</li>
<li>L2 miss 后查 LLC 或系统级缓存；</li>
<li>仍然 miss，就通过 ACE/CHI/AXI 接口发起读事务；</li>
<li>请求进入 NoC 或 AXI interconnect，和其他 master 仲裁；</li>
<li>DDR controller 接收请求，决定访问哪个 channel、rank、bank、row；</li>
<li>如果目标 row 已打开，直接读；否则需要 precharge/activate；</li>
<li>数据经过 PHY 回来，再沿互联返回 CPU；</li>
<li>Cache line 被填入，CPU 才能继续执行依赖这份数据的指令。</li>
</ol>
<p>这个路径里任何一段都可能成为瓶颈。CPU 侧看到的是 <code>cache-misses</code>、<code>stalled-cycles</code>、IPC 下降；互联侧看到的是 outstanding 堆积、QoS 延迟变大；DDR 控制器侧看到的是读写队列拥塞、page miss 增加、refresh 周期影响；业务侧看到的则是帧率下降、推理延迟抖动、实时线程偶发超时。</p>
<p>如果只盯着一个指标，很容易误判。例如 CPU Cache Miss 高，不一定表示 DDR 频率不够，也可能是数据结构布局导致空间局部性太差；DDR 带宽占用高，也不一定要提高频率，可能是多了一次无意义的内存拷贝。</p>
<h2 id="三建立第一组基准连续带宽随机延迟和业务混压">三、建立第一组基准：连续带宽、随机延迟和业务混压</h2>
<p>调优前必须先建立基线。我的习惯是至少跑三类测试：</p>
<h3 id="31-连续读写带宽">3.1 连续读写带宽</h3>
<p>连续带宽可以用 STREAM，也可以写一个简化版测试。下面这个程序不如专业工具严谨，但适合快速确认不同 buffer 大小、不同线程数下的变化趋势：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;time.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;string.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">double</span> <span class="nf">now_sec</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">struct</span> <span class="n">timespec</span> <span class="n">ts</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">clock_gettime</span><span class="p">(</span><span class="n">CLOCK_MONOTONIC</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ts</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">ts</span><span class="p">.</span><span class="n">tv_sec</span> <span class="o">+</span> <span class="n">ts</span><span class="p">.</span><span class="n">tv_nsec</span> <span class="o">/</span> <span class="mf">1000000000.0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">size_t</span> <span class="n">mb</span> <span class="o">=</span> <span class="n">argc</span> <span class="o">&gt;</span> <span class="mi">1</span> <span class="o">?</span> <span class="nf">strtoull</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> <span class="o">:</span> <span class="mi">512</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">size_t</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">mb</span> <span class="o">*</span> <span class="mi">1024ULL</span> <span class="o">*</span> <span class="mi">1024ULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="o">*</span><span class="n">dst</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">posix_memalign</span><span class="p">((</span><span class="kt">void</span> <span class="o">**</span><span class="p">)</span><span class="o">&amp;</span><span class="n">src</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">bytes</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">posix_memalign</span><span class="p">((</span><span class="kt">void</span> <span class="o">**</span><span class="p">)</span><span class="o">&amp;</span><span class="n">dst</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">bytes</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">memset</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="mh">0x5a</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">memset</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="mh">0x00</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">t0</span> <span class="o">=</span> <span class="nf">now_sec</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">r</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">r</span> <span class="o">&lt;</span> <span class="mi">20</span><span class="p">;</span> <span class="o">++</span><span class="n">r</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">memcpy</span><span class="p">(</span><span class="n">dst</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">bytes</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">t1</span> <span class="o">=</span> <span class="nf">now_sec</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kt">double</span> <span class="n">gb</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">bytes</span> <span class="o">*</span> <span class="mf">20.0</span> <span class="o">/</span> <span class="mi">1024</span> <span class="o">/</span> <span class="mi">1024</span> <span class="o">/</span> <span class="mi">1024</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;copy bandwidth: %.2f GB/s</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">,</span> <span class="n">gb</span> <span class="o">/</span> <span class="p">(</span><span class="n">t1</span> <span class="o">-</span> <span class="n">t0</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">free</span><span class="p">(</span><span class="n">src</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">free</span><span class="p">(</span><span class="n">dst</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>编译运行：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">gcc -O3 -march<span class="o">=</span>native memcopy_bw.c -o memcopy_bw
</span></span><span class="line"><span class="cl">./memcopy_bw <span class="m">1024</span>
</span></span></code></pre></div><p>注意，这个结果主要反映大块连续拷贝能力，还受到 libc <code>memcpy</code> 实现、CPU 预取、Cache 策略影响。它适合做“平台状态是否正常”的健康检查，但不能代表所有业务。</p>
<h3 id="32-随机访问延迟">3.2 随机访问延迟</h3>
<p>很多控制类、图结构、稀疏张量、数据库索引类负载不是带宽优先，而是延迟优先。连续带宽高的平台，如果随机访问延迟很差，业务一样会慢。可以用 <code>lmbench</code> 的 <code>lat_mem_rd</code>，也可以自己构造链表追指针测试。核心思路是让下一次访问依赖上一次读取结果，破坏 CPU 预取器的发挥空间。</p>
<h3 id="33-混压测试">3.3 混压测试</h3>
<p>真实 SoC 里最关键的是混压：CPU 跑内存测试的同时，让摄像头采集、显示刷新、NPU 推理、视频编码一起工作。很多问题只有在混压下出现，因为 DDR controller 和 NoC 仲裁策略这时才真正被打满。</p>
<p>我通常会记录三组数据：</p>
<ul>
<li>空载下的连续带宽和随机延迟；</li>
<li>业务单独运行时的 DDR 计数器和帧率；</li>
<li>基准测试与业务同时运行时的延迟抖动。</li>
</ul>
<p>如果单独测试都很好，一混压就抖，优先查 QoS、DMA burst、内存拷贝和任务绑核，而不是急着改 DDR 时序。</p>
<p>（第一部分完，约 2200 字）</p>
<h2 id="四用-perf-先判断是不是-cache-问题">四、用 perf 先判断是不是 Cache 问题</h2>
<p>在 Linux 上，第一步通常不是直接看 DDR controller，而是先看 CPU 的硬件性能计数器。因为很多所谓“内存带宽不够”，根因其实是 Cache 使用方式太差。</p>
<p>常用命令如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">perf stat -e cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses ./your_app
</span></span></code></pre></div><p>如果平台事件支持更完整，还可以看：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">perf stat -e stalled-cycles-frontend,stalled-cycles-backend,branch-misses,dTLB-load-misses ./your_app
</span></span></code></pre></div><p>几个经验判断：</p>
<ul>
<li><strong>IPC 很低，backend stall 很高</strong>：CPU 大概率在等内存或执行单元资源。</li>
<li><strong>LLC miss 比例高</strong>：数据工作集超出缓存，或者访问局部性差。</li>
<li><strong>dTLB miss 高</strong>：大数组随机访问、页表压力大，可以考虑 hugepage 或改善布局。</li>
<li><strong>cache-misses 高但 DDR 带宽不高</strong>：可能是随机小访问导致延迟瓶颈，而不是带宽瓶颈。</li>
</ul>
<p>一个非常典型的例子是 AoS（Array of Structs）和 SoA（Struct of Arrays）的差异。假设我们只需要处理像素的亮度 <code>y</code>，但数据结构却把多个字段混在一起：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">y</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">u</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">v</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">flag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">timestamp</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">PixelMeta</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint64_t</span> <span class="nf">sum_y_aos</span><span class="p">(</span><span class="n">PixelMeta</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">s</span> <span class="o">+=</span> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">y</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>CPU 每次拉回 Cache Line，里面包含很多当前循环不需要的字段。改成 SoA 后，访问会连续得多：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">y</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">v</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">flag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="o">*</span><span class="n">timestamp</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">PixelPlane</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint64_t</span> <span class="nf">sum_y_soa</span><span class="p">(</span><span class="n">PixelPlane</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint64_t</span> <span class="n">s</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">s</span> <span class="o">+=</span> <span class="n">p</span><span class="o">-&gt;</span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这类优化没有改 DDR 频率，也没有碰内核，却能显著减少无效带宽。尤其在图像处理、传感器融合、推理前后处理里，数据布局往往比单纯“加线程”更重要。</p>
<h2 id="五axinoc-层别让-master-互相踩脚">五、AXI/NoC 层：别让 master 互相踩脚</h2>
<p>SoC 里的 DDR 不是 CPU 独占资源。摄像头 ISP 可能持续写入帧缓冲，显示控制器周期性读取 framebuffer，NPU 读取权重和 feature map，VPU 编码器读写码流和参考帧。它们通常都通过 AXI 或片上 NoC 进入内存系统。</p>
<p>AXI 层常见的几个调优点包括：</p>
<h3 id="51-burst-长度">5.1 Burst 长度</h3>
<p>DDR 喜欢连续访问，AXI burst 太短会带来额外命令开销。对 DMA 来说，尽量让传输地址连续、长度对齐、burst 足够长。比如图像一行 stride 如果没有按 64/128 字节对齐，DMA 可能被迫拆成更多事务。</p>
<p>在驱动里申请 DMA buffer 时，至少要确认：</p>
<ul>
<li>起始地址是否满足硬件对齐；</li>
<li>每行 stride 是否满足模块要求；</li>
<li>buffer 是否跨越硬件不支持的边界；</li>
<li>是否发生了 cache sync 导致额外拷贝或刷写。</li>
</ul>
<h3 id="52-outstanding-能力">5.2 Outstanding 能力</h3>
<p>AXI master 可以同时挂起多个未完成事务。Outstanding 太小，延迟无法被隐藏；太大，又可能挤压其他实时 master。NPU/GPU 这类吞吐型设备通常需要较大的 outstanding，显示、音频、某些实时采集链路则更关心延迟上限。</p>
<p>如果芯片手册提供 NoC 或 DDR port 的 outstanding 配置，建议不要盲目拉满，而是按业务分组测试：</p>
<ul>
<li>单 NPU 推理吞吐；</li>
<li>NPU + ISP；</li>
<li>NPU + ISP + VPU；</li>
<li>加入 CPU 后处理线程。</li>
</ul>
<p>看的是整体帧率和 P99 延迟，而不是某一个模块的峰值。</p>
<h3 id="53-qos-优先级">5.3 QoS 优先级</h3>
<p>很多 SoC 的 AXI port 有 QoS 字段或内部仲裁权重。显示控制器、音频、摄像头输入这类实时流，一旦饿死就会花屏、爆音或丢帧；AI 推理慢一点通常只是延迟增加。因此，QoS 的目标不是让所有模块“公平”，而是让实时链路有确定性，让吞吐型模块吃剩余带宽。</p>
<p>一个实用策略是：</p>
<ol>
<li>先保证显示/采集链路不丢；</li>
<li>再给编码器、NPU 设置中等优先级；</li>
<li>CPU 后台任务、日志、文件 IO 降低优先级；</li>
<li>对吞吐型 DMA 使用大 burst，但限制 outstanding，避免长时间占住通道。</li>
</ol>
<h2 id="六ddr-controllerrow-hit读写切换和刷新">六、DDR Controller：Row Hit、读写切换和刷新</h2>
<p>到了 DDR controller 层，问题会变得更“硬件”。这里的核心是调度器如何把上层来的请求转换成 DRAM 命令序列。</p>
<p>DRAM 内部按 bank、row、column 组织。访问已经打开的 row，叫 row hit；如果要访问另一个 row，就需要 precharge 当前 row，再 activate 新 row，开销明显更大。因此连续访问、按行访问、减少跨 bank/row 的随机跳转，通常能提升效率。</p>
<p>控制器还要处理读写切换。读和写在总线上方向不同，频繁切换会产生 turnaround penalty。很多 controller 会倾向于攒一批读或一批写再切换，以提高总线效率；但如果攒得太久，实时写入或读取的延迟可能变差。</p>
<p>刷新也是不可忽略的因素。DRAM 需要周期性 refresh，温度越高，刷新压力可能越大。某些项目在高温箱里出现周期性延迟尖峰，最后发现不是 CPU 调度问题，而是内存刷新和业务峰值叠加。</p>
<p>如果平台暴露 DDR controller 计数器，建议关注：</p>
<ul>
<li>read/write command 数量；</li>
<li>row hit / row miss；</li>
<li>bank conflict；</li>
<li>refresh 周期；</li>
<li>port busy 或 queue full；</li>
<li>各 master 的带宽占比。</li>
</ul>
<p>不同厂商接口差异很大，有的在 <code>/sys</code>，有的通过 <code>devfreq</code>，有的要读寄存器。不要只看一个总带宽数字，最好能按 master 或 port 拆开，否则很难知道谁在消耗带宽。</p>
<h2 id="七linux-侧常见的隐藏带宽消耗">七、Linux 侧常见的隐藏带宽消耗</h2>
<p>在应用层，最浪费 DDR 的通常不是计算，而是“搬来搬去”。视频和 AI pipeline 里尤其明显：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Camera -&gt; ISP -&gt; 内存 -&gt; CPU memcpy -&gt; NPU input -&gt; 内存 -&gt; CPU 后处理 -&gt; VPU/Display
</span></span></code></pre></div><p>如果每个箭头都落 DDR，每个模块之间再做一次格式转换，带宽很快就被吃光。优化方向包括：</p>
<ul>
<li>使用 DMA-BUF 在模块之间共享 buffer；</li>
<li>尽量让 ISP/RGA/NPU/VPU 直接处理物理连续或 IOMMU 映射后的 buffer；</li>
<li>减少 CPU 参与大块图像拷贝；</li>
<li>统一颜色格式，避免 NV12/RGB/BGR 来回转换；</li>
<li>对只读权重、查找表使用合适的缓存策略；</li>
<li>对一次性 DMA buffer 避免不必要的 cache invalidate/clean。</li>
</ul>
<p>很多时候，把两次 <code>memcpy</code> 去掉，比把 DDR 频率从 3200 提到 4266 更有效，也更省电。</p>
<p>（第二部分完，约 2400 字）</p>
<h2 id="八一个可复用的带宽排查脚本">八、一个可复用的带宽排查脚本</h2>
<p>下面这个 Python 脚本用于自动跑不同 buffer 大小下的拷贝测试，并记录 <code>perf stat</code> 的关键指标。实际项目里我会把它放进 bring-up 工具箱，每次改 DDR 频率、内核版本、驱动 DMA 策略后都跑一遍。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="ch">#!/usr/bin/env python3</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">csv</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">subprocess</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">SIZES_MB</span> <span class="o">=</span> <span class="p">[</span><span class="mi">64</span><span class="p">,</span> <span class="mi">128</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">512</span><span class="p">,</span> <span class="mi">1024</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">EVENTS</span> <span class="o">=</span> <span class="s2">&#34;cycles,instructions,cache-references,cache-misses,LLC-loads,LLC-load-misses&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">BIN</span> <span class="o">=</span> <span class="s2">&#34;./memcopy_bw&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">perf_re</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^\s*([0-9,]+)\s+([A-Za-z0-9_-]+)&#34;</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">M</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">bw_re</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;copy bandwidth:\s+([0-9.]+)\s+GB/s&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">run_one</span><span class="p">(</span><span class="n">size_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;perf&#34;</span><span class="p">,</span> <span class="s2">&#34;stat&#34;</span><span class="p">,</span> <span class="s2">&#34;-e&#34;</span><span class="p">,</span> <span class="n">EVENTS</span><span class="p">,</span> <span class="n">BIN</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">size_mb</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">    <span class="n">p</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span> <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span> <span class="n">check</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">out</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">stdout</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span> <span class="o">+</span> <span class="n">p</span><span class="o">.</span><span class="n">stderr</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">row</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;size_mb&#34;</span><span class="p">:</span> <span class="n">size_mb</span><span class="p">,</span> <span class="s2">&#34;bandwidth_gbps&#34;</span><span class="p">:</span> <span class="kc">None</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="n">m</span> <span class="o">=</span> <span class="n">bw_re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">m</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">row</span><span class="p">[</span><span class="s2">&#34;bandwidth_gbps&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">value</span><span class="p">,</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">perf_re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">out</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">row</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">value</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&#34;,&#34;</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">row</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">rows</span> <span class="o">=</span> <span class="p">[</span><span class="n">run_one</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">SIZES_MB</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">keys</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">({</span><span class="n">k</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rows</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">r</span><span class="o">.</span><span class="n">keys</span><span class="p">()})</span>
</span></span><span class="line"><span class="cl"><span class="n">Path</span><span class="p">(</span><span class="s2">&#34;results&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;results/memory_perf.csv&#34;</span><span class="p">,</span> <span class="s2">&#34;w&#34;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s2">&#34;&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">keys</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">writer</span><span class="o">.</span><span class="n">writerows</span><span class="p">(</span><span class="n">rows</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rows</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">miss</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;cache-misses&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ref</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;cache-references&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">miss_rate</span> <span class="o">=</span> <span class="n">miss</span> <span class="o">/</span> <span class="n">ref</span> <span class="o">*</span> <span class="mi">100</span> <span class="k">if</span> <span class="n">ref</span> <span class="k">else</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">r</span><span class="p">[</span><span class="s1">&#39;size_mb&#39;</span><span class="p">]</span><span class="si">:</span><span class="s2">4d</span><span class="si">}</span><span class="s2"> MB  </span><span class="si">{</span><span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;bandwidth_gbps&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="si">:</span><span class="s2">6.2f</span><span class="si">}</span><span class="s2"> GB/s  cache miss </span><span class="si">{</span><span class="n">miss_rate</span><span class="si">:</span><span class="s2">5.2f</span><span class="si">}</span><span class="s2">%&#34;</span><span class="p">)</span>
</span></span></code></pre></div><p>这个脚本不能替代专业测试，但有两个好处：第一，它让每次优化都有数据记录；第二，它能快速发现“改动看似无关，内存行为却变了”的情况。例如某次驱动修改把 buffer 从 cacheable 改成 non-cacheable，CPU 后处理性能会立刻掉下来；某次设备树改错 DDR devfreq 档位，连续带宽也会明显变化。</p>
<h2 id="九实战调优顺序不要一上来就改-ddr-参数">九、实战调优顺序：不要一上来就改 DDR 参数</h2>
<p>DDR 参数很诱人，因为它看起来离瓶颈最近。但在量产项目里，随意修改 DDR training、ODT、时序、频率，风险远高于收益。我的建议顺序是：</p>
<h3 id="91-先确认频率和工作模式">9.1 先确认频率和工作模式</h3>
<p>检查 DDR 是否跑在预期频率，devfreq 是否被省电策略压低，双通道是否都启用，位宽是否符合硬件设计。很多“性能问题”最后只是设备树频点、bootloader 初始化或电源模式配置不对。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">cat /sys/class/devfreq/*/cur_freq 2&gt;/dev/null
</span></span><span class="line"><span class="cl">cat /sys/class/devfreq/*/available_frequencies 2&gt;/dev/null
</span></span></code></pre></div><p>不同平台节点名称不一样，上面命令只是示意。关键是把空载、业务运行、混压时的频率都记录下来。</p>
<h3 id="92-再减少无效流量">9.2 再减少无效流量</h3>
<p>优先去掉重复拷贝、格式来回转换、日志大吞吐写盘、调试 overlay、无意义的 buffer 清零。尤其是图像类项目，<code>memset</code> 和 <code>memcpy</code> 经常藏在看起来不起眼的封装函数里。</p>
<p>可以临时用 <code>LD_PRELOAD</code> 包装 <code>memcpy</code> 做统计，也可以在代码里对大块拷贝加 trace。不要凭感觉判断，很多团队最后会惊讶地发现，CPU 每帧搬运的数据量比原始图像大几倍。</p>
<h3 id="93-然后优化访问局部性">9.3 然后优化访问局部性</h3>
<p>包括数据结构从 AoS 改 SoA、循环顺序调整、tile/block 处理、预取、对齐、减少随机访问。矩阵和图像算法里，分块往往非常有效：让工作集留在 L1/L2 里，而不是每一步都回 DDR。</p>
<p>一个简单的二维数组遍历例子：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 差：按列访问，跨行 stride 大，Cache 利用率低
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="o">++</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">acc</span> <span class="o">+=</span> <span class="n">img</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 好：按行访问，连续读取
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">;</span> <span class="o">++</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">row</span> <span class="o">=</span> <span class="n">img</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">x</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">acc</span> <span class="o">+=</span> <span class="n">row</span><span class="p">[</span><span class="n">x</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="94-最后再碰-qos-和-ddr-控制器">9.4 最后再碰 QoS 和 DDR 控制器</h3>
<p>当确认无效流量已经压下去、访问模式也合理，但混压下仍然抖动，再去调 QoS、outstanding、读写队列、水线、DDR devfreq governor。每次只改一个变量，并记录平均值、P95、P99，而不是只看峰值。</p>
<h2 id="十几个常见故障现象和定位方向">十、几个常见故障现象和定位方向</h2>
<h3 id="101-单测很快业务很慢">10.1 单测很快，业务很慢</h3>
<p>优先怀疑多 master 竞争、额外拷贝和 Cache 同步。连续内存测试无法复现业务的读写混合和实时约束，必须做混压。</p>
<h3 id="102-平均帧率够偶发掉帧">10.2 平均帧率够，偶发掉帧</h3>
<p>看 P99 延迟、调度抢占、DDR refresh、QoS、水温/温度导致的降频。显示、摄像头、音频类问题尤其要关注最坏情况，而不是平均值。</p>
<h3 id="103-cpu-占用不高但程序慢">10.3 CPU 占用不高但程序慢</h3>
<p>可能是 backend stall，线程在等内存。用 <code>perf stat</code> 看 IPC、cache miss、dTLB miss，再结合火焰图看热点是不是大数组随机访问。</p>
<h3 id="104-npu-利用率上不去">10.4 NPU 利用率上不去</h3>
<p>不一定是模型小，也可能是输入预处理、tensor layout 转换、权重读取、NPU 与 CPU/NPU 共享 DDR 造成等待。检查是否支持零拷贝输入，是否每帧都做了不必要的 NHWC/NCHW 转换。</p>
<h3 id="105-高温后性能下降">10.5 高温后性能下降</h3>
<p>检查 DDR devfreq、CPU/GPU/NPU 降频、DRAM refresh、PMIC 限流。高温问题不要只盯 CPU 温度，内存和电源策略也会影响吞吐。</p>
<h2 id="十一量产项目里的建议清单">十一、量产项目里的建议清单</h2>
<p>最后给一份我在项目评审里常用的 checklist：</p>
<ul>
<li>DDR 频率、位宽、通道数是否和硬件设计一致；</li>
<li>bootloader 与内核里的 DDR/devfreq 配置是否一致；</li>
<li>是否有 STREAM、随机延迟、业务混压三类基线数据；</li>
<li>是否记录了 CPU PMU、DDR controller、NoC/AXI port 计数器；</li>
<li>视频/AI pipeline 是否做到 DMA-BUF 或等价的零拷贝；</li>
<li>大 buffer 是否对齐，stride 是否满足 DMA burst；</li>
<li>CPU 是否存在大块 <code>memcpy</code>、<code>memset</code>、格式转换；</li>
<li>热点数据结构是否有良好的空间局部性；</li>
<li>实时 master 的 QoS 是否高于吞吐型后台任务；</li>
<li>是否用 P95/P99 延迟评估，而不是只看平均吞吐；</li>
<li>高温、低电压、省电模式下是否重复验证。</li>
</ul>
<h2 id="总结">总结</h2>
<p>DDR 带宽调优不是单点优化，而是一条链路的系统工程。CPU 看到的是 Cache Miss，DMA 看到的是 burst 和对齐，NoC 看到的是仲裁和 outstanding，DDR controller 看到的是 row hit、读写切换和刷新，业务最终看到的是帧率、延迟和稳定性。</p>
<p>真正有效的调优顺序应该是：先建立基线，再确认频率和硬件配置；先减少无效拷贝，再改善数据局部性；先通过 perf 和业务 trace 定位瓶颈，再去调整 QoS、NoC 和 DDR controller。除非你已经有充分数据证明瓶颈在内存控制器，否则不要一上来就改 DDR 时序。</p>
<p>如果只能记住一句话，那就是：<strong>内存系统优化的目标不是跑出最高的 GB/s，而是在真实业务混压下，把有效数据稳定、可预测地送到需要它的计算单元。</strong> 这也是芯片架构和嵌入式软件最有意思的交界处——硬件给了你上限，软件决定你能接近多少。</p>
<p>（全文完，约 6800 字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
