<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>嵌入式架构 on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/%E5%B5%8C%E5%85%A5%E5%BC%8F%E6%9E%B6%E6%9E%84/</link>
    <description>Recent content in 嵌入式架构 on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Mon, 08 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/%E5%B5%8C%E5%85%A5%E5%BC%8F%E6%9E%B6%E6%9E%84/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Cortex-M Cache、MPU 与 DMA 一致性实战：把 STM32H7 这类高性能 MCU 跑稳跑快</title>
      <link>https://tech-snippets.xyz/posts/arm-cortex-m-cache-mpu-dma-coherency-guide/</link>
      <pubDate>Mon, 08 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/arm-cortex-m-cache-mpu-dma-coherency-guide/</guid>
      <description>前言：高性能 MCU 最隐蔽的坑，不是算力不够，而是数据“不一致” 很多人第一次从 Cortex-M3 / Cortex-M4 迁移到 Cortex-M7，感受非常直接：主频更高了，FPU 更强了，片上 SRAM 更大了，外设带宽也上来了。以 STM32H7、NXP i.MX RT、部分国产高性能 MCU 为例，系统里开始出现 I-Cache、D-Cache、AXI SRAM、多级总线矩阵、MDMA、ETH、SDMMC、DCMI、LTDC 这类过去在小 MCU 上不太需要认真处理的模块。代码还是 C，外设还是 DMA，调试器还是能单步，但一旦项目进入图像采集、以太网、文件系统、音频流或者屏幕刷新，问题会变得很诡异：
DMA 明明已经写完了缓冲区，CPU 读到的还是旧数据； CPU 明明把发送包填好了，以太网 DMA 发出去的却是上一帧； 关掉 D-Cache 后系统稳定了，但吞吐掉了一大截； 加了一句 SCB_CleanDCache_by_Addr() 后偶尔好、偶尔坏； 同样的代码 Debug 版本正常，Release 版本或者换了优化等级就出错； 缓冲区长度不是 32 字节倍数时，旁边的变量被“莫名其妙”污染。 这些现象的根源通常不是外设驱动写错，也不是编译器“玄学”，而是 CPU、Cache、MPU、DMA 对同一段内存的理解不一致。Cortex-M7 的 D-Cache 提升了 CPU 访问速度，但 DMA 控制器通常不会经过 D-Cache，它直接从 SRAM 或外部 RAM 读写。于是同一个地址，在 CPU 看来可能是 Cache Line 里的新数据，在 DMA 看来却是内存里的旧数据；反过来，DMA 已经把新数据写入内存，CPU 仍然命中旧的 Cache Line。</description>
      <content:encoded><![CDATA[<h2 id="前言高性能-mcu-最隐蔽的坑不是算力不够而是数据不一致">前言：高性能 MCU 最隐蔽的坑，不是算力不够，而是数据“不一致”</h2>
<p>很多人第一次从 Cortex-M3 / Cortex-M4 迁移到 Cortex-M7，感受非常直接：主频更高了，FPU 更强了，片上 SRAM 更大了，外设带宽也上来了。以 STM32H7、NXP i.MX RT、部分国产高性能 MCU 为例，系统里开始出现 I-Cache、D-Cache、AXI SRAM、多级总线矩阵、MDMA、ETH、SDMMC、DCMI、LTDC 这类过去在小 MCU 上不太需要认真处理的模块。代码还是 C，外设还是 DMA，调试器还是能单步，但一旦项目进入图像采集、以太网、文件系统、音频流或者屏幕刷新，问题会变得很诡异：</p>
<ul>
<li>DMA 明明已经写完了缓冲区，CPU 读到的还是旧数据；</li>
<li>CPU 明明把发送包填好了，以太网 DMA 发出去的却是上一帧；</li>
<li>关掉 D-Cache 后系统稳定了，但吞吐掉了一大截；</li>
<li>加了一句 <code>SCB_CleanDCache_by_Addr()</code> 后偶尔好、偶尔坏；</li>
<li>同样的代码 Debug 版本正常，Release 版本或者换了优化等级就出错；</li>
<li>缓冲区长度不是 32 字节倍数时，旁边的变量被“莫名其妙”污染。</li>
</ul>
<p>这些现象的根源通常不是外设驱动写错，也不是编译器“玄学”，而是 CPU、Cache、MPU、DMA 对同一段内存的理解不一致。Cortex-M7 的 D-Cache 提升了 CPU 访问速度，但 DMA 控制器通常不会经过 D-Cache，它直接从 SRAM 或外部 RAM 读写。于是同一个地址，在 CPU 看来可能是 Cache Line 里的新数据，在 DMA 看来却是内存里的旧数据；反过来，DMA 已经把新数据写入内存，CPU 仍然命中旧的 Cache Line。</p>
<p>本文不把重点放在寄存器手册逐位翻译上，而是从工程落地角度讲清楚三个问题：第一，Cache、MPU、DMA 为什么会互相影响；第二，如何设计可维护的内存区域和缓冲区策略；第三，怎样写出能在网络、摄像头、SD 卡、屏幕刷新场景里长期稳定运行的代码。文章示例偏向 STM32H7 / Cortex-M7，但方法同样适用于其他带 D-Cache、MPU 和 DMA 的高性能 MCU。</p>
<p><img alt="Cortex-M Cache、MPU 与 DMA 一致性关系" loading="lazy" src="/images/arm-cortex-m-cache-mpu-dma-coherency.svg"></p>
<h2 id="一先建立一个基本模型cpu-看-cachedma-看内存">一、先建立一个基本模型：CPU 看 Cache，DMA 看内存</h2>
<p>在传统 Cortex-M0 / M3 / M4 项目里，我们常常把“地址”和“数据”简单绑定：某个指针指向 SRAM，CPU 写了什么，DMA 就能读到什么；DMA 写了什么，CPU 再读就能看到什么。这个模型在没有 D-Cache 的系统里基本成立，最多需要考虑 <code>volatile</code>、中断竞争和总线带宽。</p>
<p>到了 Cortex-M7，模型必须改成：CPU 访问某个地址时，可能先访问 D-Cache；DMA 访问某个地址时，通常直接访问 SRAM、AXI SRAM、DTCM 或外部 SDRAM。D-Cache 的常见 Cache Line 大小是 32 字节，CPU 写入一个变量时，可能只是把对应 Cache Line 标记为 dirty，还没真正写回内存。CPU 读取一个地址时，如果 Cache 命中，也可能根本不去内存取 DMA 刚写入的新内容。</p>
<p>这就是所谓 Cache 一致性问题。桌面 CPU 或高端 SoC 往往有硬件一致性协议，多个核心、DMA、外设通过 ACE、CHI 等协议维持一致。但很多 MCU 的 DMA 并不参与 D-Cache 一致性协议，维护责任就落到了软件身上。</p>
<h3 id="三类常见内存的差异">三类常见内存的差异</h3>
<p>以 STM32H7 为例，开发者经常会遇到 DTCM RAM、AXI SRAM、SRAM1 / SRAM2 / SRAM3、外部 SDRAM 等区域。它们不只是容量不同，访问路径也不同：</p>
<ol>
<li><strong>DTCM RAM</strong>：CPU 访问非常快，适合栈、实时控制变量、算法中间状态。但不少 DMA 外设无法访问 DTCM，所以把 DMA 缓冲区放在 DTCM 里会直接失败，表现为 DMA 不搬运或外设无数据。</li>
<li><strong>AXI SRAM</strong>：挂在 AXI 总线矩阵上，CPU 和很多 DMA 都能访问，是大块 DMA Buffer 的常见选择。配合 Cache 后 CPU 处理速度好，但必须做一致性维护。</li>
<li><strong>外部 SDRAM / PSRAM</strong>：容量大，适合帧缓冲、神经网络中间张量、文件缓存，但延迟和仲裁复杂，更需要 Cache，也更容易暴露刷新、对齐、带宽瓶颈。</li>
</ol>
<p>所以工程里的第一个原则是：不要只问“这段内存够不够大”，还要问“谁会访问它、通过哪条总线访问、是否经过 Cache、是否允许 DMA 访问”。</p>
<h2 id="二mpu-的价值把内存使用约定固化为硬件属性">二、MPU 的价值：把“内存使用约定”固化为硬件属性</h2>
<p>MPU（Memory Protection Unit）在很多裸机项目里被忽略，大家觉得它只是 RTOS 做任务隔离、权限保护时才需要。实际上在带 D-Cache 的 MCU 上，MPU 最常用的价值是定义内存属性：某段区域是否可缓存、是否 Bufferable、是否 Shareable、是否允许执行、读写权限如何。</p>
<p>如果没有 MPU，很多库会用默认内存属性启动 Cache。默认属性对普通代码和数据可能没问题，但对 DMA 共享缓冲区未必合适。一个成熟的高性能 MCU 项目，通常会把内存划分成几类：</p>
<ul>
<li>普通代码和常规数据：可缓存，追求 CPU 性能；</li>
<li>DMA 描述符：不可缓存或严格手动维护，追求确定性；</li>
<li>大块 DMA 数据缓冲区：可缓存但按方向 Clean / Invalidate，兼顾吞吐；</li>
<li>外设寄存器区域：Device 类型，不允许乱序和缓存；</li>
<li>帧缓冲：根据 LTDC / DMA2D / CPU 绘制比例选择 Write-through、Write-back 或 Non-cacheable。</li>
</ul>
<p>这种划分看起来麻烦，但它能把“约定”变成系统启动时的硬件配置，而不是散落在驱动代码里的注释。</p>
<h3 id="一个典型的-mpu-区域规划">一个典型的 MPU 区域规划</h3>
<p>下面是一个简化的规划，不直接对应某一块芯片的完整地址表，但足够说明思路：</p>
<table>
<thead>
<tr>
<th>区域</th>
<th>用途</th>
<th>属性建议</th>
<th>原因</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flash</td>
<td>代码、只读表</td>
<td>Normal、Cacheable、Executable</td>
<td>提高取指和查表速度</td>
</tr>
<tr>
<td>DTCM</td>
<td>栈、控制变量、实时算法</td>
<td>Normal、Non-cacheable 或 TCM 默认</td>
<td>CPU 低延迟访问，不给 DMA 用</td>
</tr>
<tr>
<td>AXI SRAM 普通段</td>
<td>堆、算法 Buffer</td>
<td>Normal、Write-back Cacheable</td>
<td>提高 CPU 处理吞吐</td>
</tr>
<tr>
<td>AXI SRAM DMA 描述符段</td>
<td>ETH、SDMMC、USB 描述符</td>
<td>Normal、Non-cacheable</td>
<td>避免描述符状态不同步</td>
</tr>
<tr>
<td>AXI SRAM DMA 数据段</td>
<td>网络包、图像块、音频块</td>
<td>Normal、Cacheable + 手动维护</td>
<td>大块数据用 Cache 提速</td>
</tr>
<tr>
<td>外设寄存器</td>
<td>GPIO、DMA、ETH 等</td>
<td>Device、Non-cacheable</td>
<td>禁止缓存和不合适的重排</td>
</tr>
</tbody>
</table>
<p>需要注意的是，MPU Region 的大小和基地址一般有对齐要求，且 Region 数量有限。不要为每个小数组都单独建 Region，而应把共享 Buffer 放到统一的链接段里，例如 <code>.dma_desc</code>、<code>.dma_buffer</code>，再用链接脚本保证对齐和边界。</p>
<h2 id="三cache-维护动作cleaninvalidate-和-cleaninvalidate-的边界">三、Cache 维护动作：Clean、Invalidate 和 CleanInvalidate 的边界</h2>
<p>Cache 维护 API 名字很像，很多 bug 就出在调用时机反了。可以用一句话记住：</p>
<ul>
<li><strong>CPU 写、DMA 读</strong>：启动 DMA 前要 Clean，让 CPU 写在 Cache 里的脏数据写回内存；</li>
<li><strong>DMA 写、CPU 读</strong>：DMA 完成后要 Invalidate，让 CPU 丢掉旧 Cache Line，下次从内存取新数据；</li>
<li><strong>双向或状态不确定</strong>：谨慎使用 CleanInvalidate，但要确认不会把 DMA 新写的数据被旧脏 Cache 覆盖。</li>
</ul>
<p>以发送网络包为例，CPU 先构造以太网帧，然后 ETH DMA 从内存读取并发送。如果 CPU 的帧内容还停留在 D-Cache，ETH DMA 读到的就是旧内存。正确动作是在把描述符交给 DMA 前，对数据 Buffer 执行 Clean。</p>
<p>以接收网络包为例，ETH DMA 把数据写入内存，然后中断通知 CPU。CPU 如果之前读过这个 Buffer，对应 Cache Line 可能还在 D-Cache 里。正确动作是在 CPU 解析包之前，对 Buffer 执行 Invalidate。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define CACHE_LINE_SIZE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uintptr_t</span> <span class="nf">align_down</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">align</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">addr</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uintptr_t</span> <span class="nf">align_up</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">align</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">addr</span> <span class="o">+</span> <span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_clean_range</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">align_down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">end</span>   <span class="o">=</span> <span class="nf">align_up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_CleanDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">start</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">align_down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">end</span>   <span class="o">=</span> <span class="nf">align_up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_InvalidateDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">start</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里最关键的细节是地址和长度按 Cache Line 对齐。很多 HAL 示例只传原始地址和长度，看似能跑，实则在非对齐缓冲区上风险很高。因为硬件维护的是整条 Cache Line，不是 C 语言对象。如果一个 20 字节 DMA Buffer 与旁边变量共享同一条 32 字节 Cache Line，Invalidate 时可能把旁边变量在 Cache 里的修改也丢掉；Clean 时也可能把不该提交的数据写回。</p>
<p>（第一部分完，约 2400 字）</p>
<h2 id="四链接脚本不要让-dma-buffer-和普通变量混住">四、链接脚本：不要让 DMA Buffer 和普通变量混住</h2>
<p>只靠 <code>__attribute__((aligned(32)))</code> 能解决一部分问题，但不够系统。对齐解决的是起点问题，不能保证链接器不会把其他变量放在同一条 Cache Line 附近，也不能保证整段区域落在 DMA 可访问的 SRAM。更稳妥的做法是为 DMA 描述符和 DMA 数据建立专门的链接段。</p>
<p>以 GCC 链接脚本为例，可以在 AXI SRAM 中划出两个区域：一个不可缓存的描述符区，一个可缓存但手动维护的数据区。实际地址需要根据芯片手册调整，这里只展示结构。</p>
<pre tabindex="0"><code class="language-ld" data-lang="ld">MEMORY
{
  FLASH   (rx)  : ORIGIN = 0x08000000, LENGTH = 2048K
  DTCMRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
  AXIRAM  (xrw) : ORIGIN = 0x24000000, LENGTH = 512K
}

SECTIONS
{
  .dma_desc (NOLOAD) : ALIGN(32)
  {
    __dma_desc_start__ = .;
    *(.dma_desc*)
    . = ALIGN(32);
    __dma_desc_end__ = .;
  } &gt; AXIRAM

  .dma_buffer (NOLOAD) : ALIGN(32)
  {
    __dma_buffer_start__ = .;
    *(.dma_buffer*)
    . = ALIGN(32);
    __dma_buffer_end__ = .;
  } &gt; AXIRAM
}
</code></pre><p>在 C 代码中可以这样声明：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define DMA_ALIGN __attribute__((aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_DESC  __attribute__((section(&#34;.dma_desc&#34;), aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_BUF   __attribute__((section(&#34;.dma_buffer&#34;), aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_DESC</span> <span class="n">ETH_DMADescTypeDef</span> <span class="n">eth_rx_desc</span><span class="p">[</span><span class="n">ETH_RX_DESC_CNT</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_DESC</span> <span class="n">ETH_DMADescTypeDef</span> <span class="n">eth_tx_desc</span><span class="p">[</span><span class="n">ETH_TX_DESC_CNT</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">DMA_BUF</span> <span class="kt">uint8_t</span> <span class="n">eth_rx_pool</span><span class="p">[</span><span class="n">ETH_RX_DESC_CNT</span><span class="p">][</span><span class="mi">1536</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_BUF</span> <span class="kt">uint8_t</span> <span class="n">eth_tx_pool</span><span class="p">[</span><span class="n">ETH_TX_DESC_CNT</span><span class="p">][</span><span class="mi">1536</span><span class="p">];</span>
</span></span></code></pre></div><p>这样做有几个好处。第一，审查 map 文件时一眼能看到 DMA 资源放在哪里；第二，MPU 可以对整段 <code>.dma_desc</code> 设置 Non-cacheable；第三，数据 Buffer 至少不会和普通全局变量混在同一条 Cache Line 里；第四，换芯片或换板子时迁移成本更低。</p>
<h2 id="五mpu-初始化顺序先配置属性再打开-cache">五、MPU 初始化顺序：先配置属性，再打开 Cache</h2>
<p>MPU 和 Cache 的初始化顺序很重要。通常建议在系统启动早期完成：关闭 MPU，配置各 Region，开启 MPU，然后再开启 I-Cache / D-Cache。若系统已经跑起来再修改某段内存属性，就要非常小心先清理、失效相关 Cache，否则会留下难以复现的状态。</p>
<p>一个简化的初始化流程如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">system_memory_attr_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_Region_InitTypeDef</span> <span class="n">MPU_InitStruct</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Disable</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* AXI SRAM: default write-back cacheable */</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Enable</span>           <span class="o">=</span> <span class="n">MPU_REGION_ENABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span>           <span class="o">=</span> <span class="n">MPU_REGION_NUMBER0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span>      <span class="o">=</span> <span class="mh">0x24000000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span>             <span class="o">=</span> <span class="n">MPU_REGION_SIZE_512KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">AccessPermission</span> <span class="o">=</span> <span class="n">MPU_REGION_FULL_ACCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span>     <span class="o">=</span> <span class="n">MPU_ACCESS_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">TypeExtField</span>     <span class="o">=</span> <span class="n">MPU_TEX_LEVEL1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">DisableExec</span>      <span class="o">=</span> <span class="n">MPU_INSTRUCTION_ACCESS_DISABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">SubRegionDisable</span> <span class="o">=</span> <span class="mh">0x00</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* DMA descriptor window: non-cacheable */</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span>           <span class="o">=</span> <span class="n">MPU_REGION_NUMBER1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span>      <span class="o">=</span> <span class="mh">0x24070000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span>             <span class="o">=</span> <span class="n">MPU_REGION_SIZE_16KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span>     <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Enable</span><span class="p">(</span><span class="n">MPU_PRIVILEGED_DEFAULT</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_EnableICache</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_EnableDCache</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这段代码不能原样复制到所有项目，因为不同芯片的 HAL 宏、地址、Region 大小和默认属性会不同。真正需要复制的是原则：外设寄存器保持 Device 属性；普通内存尽量 Cacheable；描述符区优先 Non-cacheable；大块数据区如果追求吞吐，就 Cacheable 加维护函数；所有区域基址和大小满足 MPU 规则。</p>
<h2 id="六发送方向案例cpu-生产数据dma-消费数据">六、发送方向案例：CPU 生产数据，DMA 消费数据</h2>
<p>发送方向是最容易讲清楚的场景。假设我们用 SPI DMA 发送一段采样数据，或者用 ETH DMA 发送网络帧。数据流是：CPU 写 Buffer，DMA 读 Buffer，外设发送。</p>
<p>正确流程应该是：</p>
<ol>
<li>CPU 填写 Buffer；</li>
<li>如果 Buffer 位于 Cacheable 区域，执行 Clean；</li>
<li>设置 DMA 源地址、长度、方向；</li>
<li>必要时执行内存屏障；</li>
<li>启动 DMA；</li>
<li>DMA 完成后释放 Buffer 或进入下一轮。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">eth_send_frame</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">len</span> <span class="o">&gt;</span> <span class="mi">1536</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* CPU 已经把以太网帧写入 buf。启动 DMA 前必须写回内存。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">__DMB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* 下面是假想接口：把 buf 交给 DMA 描述符 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">eth_tx_desc_prepare</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">eth_tx_kick_dma</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里有一个容易被忽视的问题：如果描述符本身也在 Cacheable 区域，除了 Clean 数据 Buffer，还要 Clean 描述符。很多以太网驱动会把描述符设为 Non-cacheable，这样 CPU 改描述符后 DMA 立刻能看到，驱动逻辑更简单。代价是 CPU 访问描述符略慢，但描述符很小，这点开销通常可以接受。</p>
<p>对于 SDMMC 写卡、QSPI DMA 写 Flash、SAI / I2S 播放音频，道理完全一样。只要方向是 CPU 先写、DMA 后读，核心动作就是 Clean。</p>
<h2 id="七接收方向案例dma-生产数据cpu-消费数据">七、接收方向案例：DMA 生产数据，CPU 消费数据</h2>
<p>接收方向更容易出错，因为 CPU 可能在 DMA 完成前无意中读过 Buffer。例如网络栈初始化时清零、调试打印、协议栈预取，都会让对应 Cache Line 进入 D-Cache。DMA 完成后如果不 Invalidate，CPU 解析的仍然可能是旧内容。</p>
<p>正确流程应该是：</p>
<ol>
<li>准备空 Buffer；</li>
<li>如果 Buffer 曾被 CPU 写过，启动 DMA 前可根据情况 Clean 或 CleanInvalidate；</li>
<li>启动 DMA，让外设写入内存；</li>
<li>等待完成中断或轮询完成标志；</li>
<li>CPU 读取前执行 Invalidate；</li>
<li>解析 Buffer。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">camera_capture_one</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">frame</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="cm">/* 如果之前 CPU 清过 frame，且该区域可缓存，启动前先清理到内存。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">dcmi_start_dma</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">wait_frame_done</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">dcmi_stop_dma</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* DMA 已写完，CPU 读取前丢弃旧 Cache Line。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>有些资料会建议“接收前 Invalidate 一次，接收后再 Invalidate 一次”。这并非完全没有道理：接收前 Invalidate 可以避免某条 dirty Cache Line 在后续被替换时写回，覆盖 DMA 新数据；接收后 Invalidate 则确保 CPU 看到 DMA 结果。但工程上更推荐避免 CPU 在 DMA 运行期间访问同一 Buffer，并通过 Buffer 状态机明确所有权。如果所有权混乱，再多维护函数也只是降低概率，不是根治。</p>
<h2 id="八双缓冲与环形队列用所有权模型替代到处加维护函数">八、双缓冲与环形队列：用所有权模型替代“到处加维护函数”</h2>
<p>高吞吐外设通常不会只用一个 Buffer。摄像头有帧缓冲，音频有 ping-pong Buffer，网络有 RX / TX Descriptor Ring，SD 卡有块缓存。此时最重要的不是多调用几个 Cache API，而是建立清晰的 Buffer 所有权模型。</p>
<p>可以把每个 Buffer 的状态定义为：</p>
<ul>
<li><code>FREE</code>：CPU 可以填充或交给外设；</li>
<li><code>DMA_OWNED</code>：DMA 正在读写，CPU 禁止访问；</li>
<li><code>CPU_OWNED</code>：DMA 已完成，CPU 可以解析或修改；</li>
<li><code>QUEUED</code>：已经放入协议栈或应用队列，等待消费。</li>
</ul>
<p>发送方向中，Buffer 从 <code>CPU_OWNED</code> 变成 <code>DMA_OWNED</code> 的瞬间执行 Clean；接收方向中，Buffer 从 <code>DMA_OWNED</code> 变成 <code>CPU_OWNED</code> 的瞬间执行 Invalidate。维护动作绑定在状态转换点，而不是散落在业务代码里。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_FREE</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_CPU_OWNED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_DMA_OWNED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_QUEUED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">buf_state_t</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">volatile</span> <span class="kt">buf_state_t</span> <span class="n">state</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">dma_buf_t</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">rx_dma_complete_isr</span><span class="p">(</span><span class="kt">dma_buf_t</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">actual_len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">=</span> <span class="n">actual_len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">actual_len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">BUF_CPU_OWNED</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">tx_submit</span><span class="p">(</span><span class="kt">dma_buf_t</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">BUF_DMA_OWNED</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_submit_to_hw</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这种写法还有一个额外好处：后续如果你把某些 Buffer 改成 Non-cacheable，只需要调整 <code>dma_clean_range()</code> / <code>dma_invalidate_range()</code> 的实现或内存属性，不必到业务层逐个删代码。</p>
<p>（第二部分完，约 2600 字）</p>
<h2 id="九什么时候应该把-dma-区域设为-non-cacheable">九、什么时候应该把 DMA 区域设为 Non-cacheable？</h2>
<p>既然手动维护 Cache 这么容易出错，那是不是把所有 DMA Buffer 都设成 Non-cacheable 就好？答案是：可以，但要看数据规模和访问模式。</p>
<p>适合 Non-cacheable 的对象通常有三类。第一类是描述符、状态字、门铃寄存器镜像这类小对象，它们频繁被 CPU 和 DMA 交替访问，数据量小，追求确定性大于追求 Cache 命中。第二类是低速外设的小包 Buffer，例如 UART DMA 收几十字节命令，Cache 带来的收益有限。第三类是调试阶段，为了尽快排除一致性问题，可以临时把共享区设为 Non-cacheable，确认问题是否由 Cache 引起。</p>
<p>不适合 Non-cacheable 的对象也很典型：摄像头整帧、LCD Framebuffer、神经网络输入输出张量、网络大吞吐数据池、文件系统块缓存。如果 CPU 会对这些数据做大量扫描、拷贝、颜色转换、校验或协议解析，完全关闭 Cache 往往会让性能掉到不可接受。比如 800×480 的 RGB565 帧缓冲接近 750KB，CPU 绘制 UI 时如果每次都直接打到外部 SDRAM，刷新率和响应都会很难看。</p>
<p>因此更实用的策略是：<strong>描述符 Non-cacheable，大数据 Cacheable；维护动作集中封装；所有 Buffer 32 字节对齐；禁止 CPU 和 DMA 同时拥有同一 Buffer。</strong></p>
<h2 id="十性能调优一致性正确只是第一步">十、性能调优：一致性正确只是第一步</h2>
<p>很多项目修完 Cache bug 后才发现，性能仍然不理想。原因是 Cache 一致性解决的是“数据对不对”，而吞吐还受总线仲裁、内存 Bank、突发长度、访问模式影响。</p>
<h3 id="1-减少无意义的整帧-clean--invalidate">1. 减少无意义的整帧 Clean / Invalidate</h3>
<p>维护 Cache 也有成本。对一个几百 KB 的帧缓冲整帧 Invalidate，会占用明显时间。如果 DMA 实际只写了一个 ROI 区域，就只维护对应范围；如果网络包实际长度是 300 字节，不要对整个 1536 字节池做维护。</p>
<h3 id="2-避免-cpu-在-dma-期间扫描同一区域">2. 避免 CPU 在 DMA 期间扫描同一区域</h3>
<p>有些代码为了显示进度或做调试，会在 DMA 接收过程中读取 Buffer 的前几个字节。这会让 Cache 状态复杂化，也可能造成总线竞争。更好的做法是用单独的状态变量或 DMA 中断标志表示进度，不要偷看正在被 DMA 拥有的数据区。</p>
<h3 id="3-对齐不仅是-cache-line也是总线突发">3. 对齐不仅是 Cache Line，也是总线突发</h3>
<p>32 字节对齐解决 Cache Line 问题，但某些 DMA 或总线对 64 字节、128 字节边界更友好。图像和音频这类连续大块数据，尽量让行跨度、块大小、环形 Buffer 节点都接近硬件推荐的突发长度，减少跨 Bank 和非对齐传输。</p>
<h3 id="4-用-dwt-或-etm-做真实测量">4. 用 DWT 或 ETM 做真实测量</h3>
<p>不要只靠主观感觉判断优化效果。Cortex-M 提供 DWT cycle counter，可以很方便地测量 Clean、Invalidate、memcpy、协议解析耗时。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">dwt_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">CoreDebug</span><span class="o">-&gt;</span><span class="n">DEMCR</span> <span class="o">|=</span> <span class="n">CoreDebug_DEMCR_TRCENA_Msk</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CTRL</span> <span class="o">|=</span> <span class="n">DWT_CTRL_CYCCNTENA_Msk</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint32_t</span> <span class="nf">measure_invalidate_cycles</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>把测量结果打印成表格，往往比猜测有效得多。你会看到维护 64 字节、1500 字节、64KB、整帧图像的成本完全不同，也会发现某些“为了保险”的全局维护动作其实非常昂贵。</p>
<h2 id="十一常见故障排查清单">十一、常见故障排查清单</h2>
<p>遇到 DMA + Cache 相关问题时，可以按下面顺序排查，效率会高很多。</p>
<ol>
<li><strong>确认 DMA 能访问该内存区域</strong>：很多 DMA 不能访问 DTCM，先查总线矩阵和参考手册，不要只看地址是 SRAM 就默认可以。</li>
<li><strong>确认 Buffer 地址和长度对齐</strong>：地址至少 32 字节对齐，长度向上补齐，最好让链接段边界也对齐。</li>
<li><strong>确认方向对应的维护动作</strong>：CPU 写 DMA 读用 Clean；DMA 写 CPU 读用 Invalidate；描述符也要考虑。</li>
<li><strong>确认维护时机</strong>：Clean 必须在启动 DMA 前；Invalidate 必须在 DMA 完成后、CPU 读取前。</li>
<li><strong>确认没有并发访问</strong>：DMA 拥有 Buffer 期间，CPU 不要读写同一范围。</li>
<li><strong>确认 MPU 属性符合预期</strong>：读回 MPU 配置或在启动日志打印 Region 表，不要只相信 CubeMX 或默认启动文件。</li>
<li><strong>确认优化等级下屏障仍然存在</strong>：关键状态切换处加 <code>__DMB()</code> / <code>__DSB()</code>，尤其是描述符交接和中断处理路径。</li>
<li><strong>临时关闭 D-Cache 做 A/B 测试</strong>：如果关闭后问题消失，基本可以把方向锁定在一致性维护或 MPU 属性。</li>
<li><strong>检查 map 文件</strong>：确认 Buffer 没被链接到错误区域，也没有因为拼写错误导致 section 属性失效。</li>
<li><strong>检查库代码隐藏访问</strong>：协议栈、文件系统、图形库可能提前读写 Buffer，需要把所有权规则延伸到库接口边界。</li>
</ol>
<h2 id="十二一个更完整的工程封装思路">十二、一个更完整的工程封装思路</h2>
<p>实际项目里，不建议业务代码直接调用 <code>SCB_CleanDCache_by_Addr()</code>。可以建立一个很薄的 <code>dma_mem</code> 模块，统一处理对齐、属性、统计和调试断言。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* dma_mem.h */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#pragma once
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="cp">#define DMA_CACHE_LINE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_SECTION __attribute__((section(&#34;.dma_buffer&#34;), aligned(DMA_CACHE_LINE)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_DESC_SECTION __attribute__((section(&#34;.dma_desc&#34;), aligned(DMA_CACHE_LINE)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_init</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_clean</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_invalidate</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_check_range</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* dma_mem.c */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;dma_mem.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;cmsis_gcc.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;core_cm7.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uintptr_t</span> <span class="nf">down</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">v</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uintptr_t</span> <span class="nf">up</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">v</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">v</span> <span class="o">+</span> <span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_clean</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">s</span> <span class="o">=</span> <span class="nf">down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">e</span> <span class="o">=</span> <span class="nf">up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_CleanDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">e</span> <span class="o">-</span> <span class="n">s</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_invalidate</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">s</span> <span class="o">=</span> <span class="nf">down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">e</span> <span class="o">=</span> <span class="nf">up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_InvalidateDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">e</span> <span class="o">-</span> <span class="n">s</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>在 Debug 版本中，<code>dma_mem_check_range()</code> 可以检查地址是否落在允许 DMA 的内存段，长度是否超过 Buffer 边界，地址是否按 32 字节对齐。这样问题会在开发阶段暴露，而不是在客户现场变成偶发死机。</p>
<h2 id="十三把经验落到团队规范里">十三、把经验落到团队规范里</h2>
<p>Cache 一致性不是某个驱动工程师的个人习惯，而应该进入团队编码规范。建议至少写下以下约定：</p>
<ul>
<li>所有 DMA Buffer 必须使用统一宏声明，不允许随手 <code>static uint8_t buf[1024]</code>；</li>
<li>所有 DMA 描述符必须放在 <code>.dma_desc</code>；</li>
<li>所有 DMA 数据池必须放在 <code>.dma_buffer</code> 或明确标注 Non-cacheable；</li>
<li>驱动提交 Buffer 前后必须通过 <code>dma_mem_*</code> 接口维护；</li>
<li>DMA 拥有 Buffer 期间 CPU 禁止访问，调试打印也不例外；</li>
<li>新增外设驱动时必须在评审中说明内存区域、方向、维护动作和对齐策略；</li>
<li>每次修改链接脚本、MPU 初始化、外部 RAM 配置后必须跑 DMA 回归测试。</li>
</ul>
<p>回归测试可以设计得很朴素：构造不同长度、不同偏移、不同填充值的 Buffer，让 DMA 做内存到内存搬运或外设环回；CPU 在前后计算 CRC；同时测试 1 字节、31 字节、32 字节、33 字节、1500 字节、4096 字节等边界长度。很多隐藏问题会在 31 / 33 字节这种非整 Cache Line 长度上暴露。</p>
<h2 id="十四总结高性能-mcu-要用-soc-思维来写">十四、总结：高性能 MCU 要用 SoC 思维来写</h2>
<p>Cortex-M7 这类高性能 MCU 仍然保留了单片机的开发体验，但它的内存系统已经接近小型 SoC：Cache、MPU、多总线、多 SRAM 域、外部 RAM、多个 DMA 主设备同时存在。如果继续用“所有地址都等价”的旧模型写代码，项目越接近量产，越容易遇到偶发、难复现、难定位的问题。</p>
<p>把系统跑稳的关键并不复杂：先画清楚 CPU、DMA 和内存之间的访问路径；用 MPU 固化内存属性；用链接脚本隔离描述符和数据池；按方向正确执行 Clean / Invalidate；把维护动作绑定到 Buffer 所有权转换点；最后用 DWT、CRC、边界长度测试验证性能和正确性。</p>
<p>经验上，最可靠的方案往往不是“全局关闭 Cache”，也不是“到处补一行 Invalidate”，而是工程化地管理内存。描述符小而关键，可以 Non-cacheable；大块数据需要吞吐，就让它 Cacheable，但必须有严格的对齐、所有权和维护封装。这样既能保住 Cortex-M7 的性能，也能让 ETH、SDMMC、DCMI、LTDC、DMA2D 这些高带宽外设稳定工作。</p>
<p>如果你正在做 STM32H7 摄像头网关、工业以太网节点、带屏 UI 控制器、音频采集设备或端侧 AI 推理板卡，建议尽早把本文这套规范放进工程模板。越早建立内存区域和 DMA Buffer 的纪律，后期排查“偶现脏数据”的时间就越少，系统的可维护性也会高很多。</p>
<h2 id="十五附录一套最小回归用例">十五、附录：一套最小回归用例</h2>
<p>最后给一个简单但很有效的回归思路。准备一块 4KB 的 DMA 测试区，分别从偏移 0、1、15、31、32、33 字节开始测试，再覆盖长度 1、16、31、32、33、127、256、1500 字节。每次测试前由 CPU 写入递增模式，Clean 后让 DMA 搬运到另一块区域；DMA 完成后对目标区 Invalidate，再由 CPU 计算 CRC。随后反过来，让 DMA 写入源区，CPU 在 Invalidate 后校验。这个测试不依赖复杂外设，很多芯片可以用内存到内存 DMA 完成，适合作为板级 bring-up 的第一项检查。</p>
<p>如果这套边界测试能在开启 D-Cache、最高优化等级、RTOS 任务切换和中断压力同时存在的情况下连续运行数小时，说明内存属性、链接脚本、对齐封装和维护时机基本可信。反之，只要 31 字节、33 字节或非零偏移用例失败，就不要急着怀疑协议栈，先回到 Cache Line、所有权和 MPU Region 这三件事上排查。</p>
<p>（全文完，约 8000 字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
