<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>STM32H7 on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/stm32h7/</link>
    <description>Recent content in STM32H7 on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Mon, 08 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/stm32h7/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Cortex-M Cache、MPU 与 DMA 一致性实战：把 STM32H7 这类高性能 MCU 跑稳跑快</title>
      <link>https://tech-snippets.xyz/posts/arm-cortex-m-cache-mpu-dma-coherency-guide/</link>
      <pubDate>Mon, 08 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/arm-cortex-m-cache-mpu-dma-coherency-guide/</guid>
      <description>前言：高性能 MCU 最隐蔽的坑，不是算力不够，而是数据“不一致” 很多人第一次从 Cortex-M3 / Cortex-M4 迁移到 Cortex-M7，感受非常直接：主频更高了，FPU 更强了，片上 SRAM 更大了，外设带宽也上来了。以 STM32H7、NXP i.MX RT、部分国产高性能 MCU 为例，系统里开始出现 I-Cache、D-Cache、AXI SRAM、多级总线矩阵、MDMA、ETH、SDMMC、DCMI、LTDC 这类过去在小 MCU 上不太需要认真处理的模块。代码还是 C，外设还是 DMA，调试器还是能单步，但一旦项目进入图像采集、以太网、文件系统、音频流或者屏幕刷新，问题会变得很诡异：
DMA 明明已经写完了缓冲区，CPU 读到的还是旧数据； CPU 明明把发送包填好了，以太网 DMA 发出去的却是上一帧； 关掉 D-Cache 后系统稳定了，但吞吐掉了一大截； 加了一句 SCB_CleanDCache_by_Addr() 后偶尔好、偶尔坏； 同样的代码 Debug 版本正常，Release 版本或者换了优化等级就出错； 缓冲区长度不是 32 字节倍数时，旁边的变量被“莫名其妙”污染。 这些现象的根源通常不是外设驱动写错，也不是编译器“玄学”，而是 CPU、Cache、MPU、DMA 对同一段内存的理解不一致。Cortex-M7 的 D-Cache 提升了 CPU 访问速度，但 DMA 控制器通常不会经过 D-Cache，它直接从 SRAM 或外部 RAM 读写。于是同一个地址，在 CPU 看来可能是 Cache Line 里的新数据，在 DMA 看来却是内存里的旧数据；反过来，DMA 已经把新数据写入内存，CPU 仍然命中旧的 Cache Line。</description>
      <content:encoded><![CDATA[<h2 id="前言高性能-mcu-最隐蔽的坑不是算力不够而是数据不一致">前言：高性能 MCU 最隐蔽的坑，不是算力不够，而是数据“不一致”</h2>
<p>很多人第一次从 Cortex-M3 / Cortex-M4 迁移到 Cortex-M7，感受非常直接：主频更高了，FPU 更强了，片上 SRAM 更大了，外设带宽也上来了。以 STM32H7、NXP i.MX RT、部分国产高性能 MCU 为例，系统里开始出现 I-Cache、D-Cache、AXI SRAM、多级总线矩阵、MDMA、ETH、SDMMC、DCMI、LTDC 这类过去在小 MCU 上不太需要认真处理的模块。代码还是 C，外设还是 DMA，调试器还是能单步，但一旦项目进入图像采集、以太网、文件系统、音频流或者屏幕刷新，问题会变得很诡异：</p>
<ul>
<li>DMA 明明已经写完了缓冲区，CPU 读到的还是旧数据；</li>
<li>CPU 明明把发送包填好了，以太网 DMA 发出去的却是上一帧；</li>
<li>关掉 D-Cache 后系统稳定了，但吞吐掉了一大截；</li>
<li>加了一句 <code>SCB_CleanDCache_by_Addr()</code> 后偶尔好、偶尔坏；</li>
<li>同样的代码 Debug 版本正常，Release 版本或者换了优化等级就出错；</li>
<li>缓冲区长度不是 32 字节倍数时，旁边的变量被“莫名其妙”污染。</li>
</ul>
<p>这些现象的根源通常不是外设驱动写错，也不是编译器“玄学”，而是 CPU、Cache、MPU、DMA 对同一段内存的理解不一致。Cortex-M7 的 D-Cache 提升了 CPU 访问速度，但 DMA 控制器通常不会经过 D-Cache，它直接从 SRAM 或外部 RAM 读写。于是同一个地址，在 CPU 看来可能是 Cache Line 里的新数据，在 DMA 看来却是内存里的旧数据；反过来，DMA 已经把新数据写入内存，CPU 仍然命中旧的 Cache Line。</p>
<p>本文不把重点放在寄存器手册逐位翻译上，而是从工程落地角度讲清楚三个问题：第一，Cache、MPU、DMA 为什么会互相影响；第二，如何设计可维护的内存区域和缓冲区策略；第三，怎样写出能在网络、摄像头、SD 卡、屏幕刷新场景里长期稳定运行的代码。文章示例偏向 STM32H7 / Cortex-M7，但方法同样适用于其他带 D-Cache、MPU 和 DMA 的高性能 MCU。</p>
<p><img alt="Cortex-M Cache、MPU 与 DMA 一致性关系" loading="lazy" src="/images/arm-cortex-m-cache-mpu-dma-coherency.svg"></p>
<h2 id="一先建立一个基本模型cpu-看-cachedma-看内存">一、先建立一个基本模型：CPU 看 Cache，DMA 看内存</h2>
<p>在传统 Cortex-M0 / M3 / M4 项目里，我们常常把“地址”和“数据”简单绑定：某个指针指向 SRAM，CPU 写了什么，DMA 就能读到什么；DMA 写了什么，CPU 再读就能看到什么。这个模型在没有 D-Cache 的系统里基本成立，最多需要考虑 <code>volatile</code>、中断竞争和总线带宽。</p>
<p>到了 Cortex-M7，模型必须改成：CPU 访问某个地址时，可能先访问 D-Cache；DMA 访问某个地址时，通常直接访问 SRAM、AXI SRAM、DTCM 或外部 SDRAM。D-Cache 的常见 Cache Line 大小是 32 字节，CPU 写入一个变量时，可能只是把对应 Cache Line 标记为 dirty，还没真正写回内存。CPU 读取一个地址时，如果 Cache 命中，也可能根本不去内存取 DMA 刚写入的新内容。</p>
<p>这就是所谓 Cache 一致性问题。桌面 CPU 或高端 SoC 往往有硬件一致性协议，多个核心、DMA、外设通过 ACE、CHI 等协议维持一致。但很多 MCU 的 DMA 并不参与 D-Cache 一致性协议，维护责任就落到了软件身上。</p>
<h3 id="三类常见内存的差异">三类常见内存的差异</h3>
<p>以 STM32H7 为例，开发者经常会遇到 DTCM RAM、AXI SRAM、SRAM1 / SRAM2 / SRAM3、外部 SDRAM 等区域。它们不只是容量不同，访问路径也不同：</p>
<ol>
<li><strong>DTCM RAM</strong>：CPU 访问非常快，适合栈、实时控制变量、算法中间状态。但不少 DMA 外设无法访问 DTCM，所以把 DMA 缓冲区放在 DTCM 里会直接失败，表现为 DMA 不搬运或外设无数据。</li>
<li><strong>AXI SRAM</strong>：挂在 AXI 总线矩阵上，CPU 和很多 DMA 都能访问，是大块 DMA Buffer 的常见选择。配合 Cache 后 CPU 处理速度好，但必须做一致性维护。</li>
<li><strong>外部 SDRAM / PSRAM</strong>：容量大，适合帧缓冲、神经网络中间张量、文件缓存，但延迟和仲裁复杂，更需要 Cache，也更容易暴露刷新、对齐、带宽瓶颈。</li>
</ol>
<p>所以工程里的第一个原则是：不要只问“这段内存够不够大”，还要问“谁会访问它、通过哪条总线访问、是否经过 Cache、是否允许 DMA 访问”。</p>
<h2 id="二mpu-的价值把内存使用约定固化为硬件属性">二、MPU 的价值：把“内存使用约定”固化为硬件属性</h2>
<p>MPU（Memory Protection Unit）在很多裸机项目里被忽略，大家觉得它只是 RTOS 做任务隔离、权限保护时才需要。实际上在带 D-Cache 的 MCU 上，MPU 最常用的价值是定义内存属性：某段区域是否可缓存、是否 Bufferable、是否 Shareable、是否允许执行、读写权限如何。</p>
<p>如果没有 MPU，很多库会用默认内存属性启动 Cache。默认属性对普通代码和数据可能没问题，但对 DMA 共享缓冲区未必合适。一个成熟的高性能 MCU 项目，通常会把内存划分成几类：</p>
<ul>
<li>普通代码和常规数据：可缓存，追求 CPU 性能；</li>
<li>DMA 描述符：不可缓存或严格手动维护，追求确定性；</li>
<li>大块 DMA 数据缓冲区：可缓存但按方向 Clean / Invalidate，兼顾吞吐；</li>
<li>外设寄存器区域：Device 类型，不允许乱序和缓存；</li>
<li>帧缓冲：根据 LTDC / DMA2D / CPU 绘制比例选择 Write-through、Write-back 或 Non-cacheable。</li>
</ul>
<p>这种划分看起来麻烦，但它能把“约定”变成系统启动时的硬件配置，而不是散落在驱动代码里的注释。</p>
<h3 id="一个典型的-mpu-区域规划">一个典型的 MPU 区域规划</h3>
<p>下面是一个简化的规划，不直接对应某一块芯片的完整地址表，但足够说明思路：</p>
<table>
<thead>
<tr>
<th>区域</th>
<th>用途</th>
<th>属性建议</th>
<th>原因</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flash</td>
<td>代码、只读表</td>
<td>Normal、Cacheable、Executable</td>
<td>提高取指和查表速度</td>
</tr>
<tr>
<td>DTCM</td>
<td>栈、控制变量、实时算法</td>
<td>Normal、Non-cacheable 或 TCM 默认</td>
<td>CPU 低延迟访问，不给 DMA 用</td>
</tr>
<tr>
<td>AXI SRAM 普通段</td>
<td>堆、算法 Buffer</td>
<td>Normal、Write-back Cacheable</td>
<td>提高 CPU 处理吞吐</td>
</tr>
<tr>
<td>AXI SRAM DMA 描述符段</td>
<td>ETH、SDMMC、USB 描述符</td>
<td>Normal、Non-cacheable</td>
<td>避免描述符状态不同步</td>
</tr>
<tr>
<td>AXI SRAM DMA 数据段</td>
<td>网络包、图像块、音频块</td>
<td>Normal、Cacheable + 手动维护</td>
<td>大块数据用 Cache 提速</td>
</tr>
<tr>
<td>外设寄存器</td>
<td>GPIO、DMA、ETH 等</td>
<td>Device、Non-cacheable</td>
<td>禁止缓存和不合适的重排</td>
</tr>
</tbody>
</table>
<p>需要注意的是，MPU Region 的大小和基地址一般有对齐要求，且 Region 数量有限。不要为每个小数组都单独建 Region，而应把共享 Buffer 放到统一的链接段里，例如 <code>.dma_desc</code>、<code>.dma_buffer</code>，再用链接脚本保证对齐和边界。</p>
<h2 id="三cache-维护动作cleaninvalidate-和-cleaninvalidate-的边界">三、Cache 维护动作：Clean、Invalidate 和 CleanInvalidate 的边界</h2>
<p>Cache 维护 API 名字很像，很多 bug 就出在调用时机反了。可以用一句话记住：</p>
<ul>
<li><strong>CPU 写、DMA 读</strong>：启动 DMA 前要 Clean，让 CPU 写在 Cache 里的脏数据写回内存；</li>
<li><strong>DMA 写、CPU 读</strong>：DMA 完成后要 Invalidate，让 CPU 丢掉旧 Cache Line，下次从内存取新数据；</li>
<li><strong>双向或状态不确定</strong>：谨慎使用 CleanInvalidate，但要确认不会把 DMA 新写的数据被旧脏 Cache 覆盖。</li>
</ul>
<p>以发送网络包为例，CPU 先构造以太网帧，然后 ETH DMA 从内存读取并发送。如果 CPU 的帧内容还停留在 D-Cache，ETH DMA 读到的就是旧内存。正确动作是在把描述符交给 DMA 前，对数据 Buffer 执行 Clean。</p>
<p>以接收网络包为例，ETH DMA 把数据写入内存，然后中断通知 CPU。CPU 如果之前读过这个 Buffer，对应 Cache Line 可能还在 D-Cache 里。正确动作是在 CPU 解析包之前，对 Buffer 执行 Invalidate。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define CACHE_LINE_SIZE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uintptr_t</span> <span class="nf">align_down</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">align</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">addr</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">uintptr_t</span> <span class="nf">align_up</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">align</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">addr</span> <span class="o">+</span> <span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">align</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_clean_range</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">align_down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">end</span>   <span class="o">=</span> <span class="nf">align_up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_CleanDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">start</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">start</span> <span class="o">=</span> <span class="nf">align_down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">end</span>   <span class="o">=</span> <span class="nf">align_up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">,</span> <span class="n">CACHE_LINE_SIZE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_InvalidateDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">start</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里最关键的细节是地址和长度按 Cache Line 对齐。很多 HAL 示例只传原始地址和长度，看似能跑，实则在非对齐缓冲区上风险很高。因为硬件维护的是整条 Cache Line，不是 C 语言对象。如果一个 20 字节 DMA Buffer 与旁边变量共享同一条 32 字节 Cache Line，Invalidate 时可能把旁边变量在 Cache 里的修改也丢掉；Clean 时也可能把不该提交的数据写回。</p>
<p>（第一部分完，约 2400 字）</p>
<h2 id="四链接脚本不要让-dma-buffer-和普通变量混住">四、链接脚本：不要让 DMA Buffer 和普通变量混住</h2>
<p>只靠 <code>__attribute__((aligned(32)))</code> 能解决一部分问题，但不够系统。对齐解决的是起点问题，不能保证链接器不会把其他变量放在同一条 Cache Line 附近，也不能保证整段区域落在 DMA 可访问的 SRAM。更稳妥的做法是为 DMA 描述符和 DMA 数据建立专门的链接段。</p>
<p>以 GCC 链接脚本为例，可以在 AXI SRAM 中划出两个区域：一个不可缓存的描述符区，一个可缓存但手动维护的数据区。实际地址需要根据芯片手册调整，这里只展示结构。</p>
<pre tabindex="0"><code class="language-ld" data-lang="ld">MEMORY
{
  FLASH   (rx)  : ORIGIN = 0x08000000, LENGTH = 2048K
  DTCMRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
  AXIRAM  (xrw) : ORIGIN = 0x24000000, LENGTH = 512K
}

SECTIONS
{
  .dma_desc (NOLOAD) : ALIGN(32)
  {
    __dma_desc_start__ = .;
    *(.dma_desc*)
    . = ALIGN(32);
    __dma_desc_end__ = .;
  } &gt; AXIRAM

  .dma_buffer (NOLOAD) : ALIGN(32)
  {
    __dma_buffer_start__ = .;
    *(.dma_buffer*)
    . = ALIGN(32);
    __dma_buffer_end__ = .;
  } &gt; AXIRAM
}
</code></pre><p>在 C 代码中可以这样声明：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define DMA_ALIGN __attribute__((aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_DESC  __attribute__((section(&#34;.dma_desc&#34;), aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_BUF   __attribute__((section(&#34;.dma_buffer&#34;), aligned(32)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_DESC</span> <span class="n">ETH_DMADescTypeDef</span> <span class="n">eth_rx_desc</span><span class="p">[</span><span class="n">ETH_RX_DESC_CNT</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_DESC</span> <span class="n">ETH_DMADescTypeDef</span> <span class="n">eth_tx_desc</span><span class="p">[</span><span class="n">ETH_TX_DESC_CNT</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">DMA_BUF</span> <span class="kt">uint8_t</span> <span class="n">eth_rx_pool</span><span class="p">[</span><span class="n">ETH_RX_DESC_CNT</span><span class="p">][</span><span class="mi">1536</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="n">DMA_BUF</span> <span class="kt">uint8_t</span> <span class="n">eth_tx_pool</span><span class="p">[</span><span class="n">ETH_TX_DESC_CNT</span><span class="p">][</span><span class="mi">1536</span><span class="p">];</span>
</span></span></code></pre></div><p>这样做有几个好处。第一，审查 map 文件时一眼能看到 DMA 资源放在哪里；第二，MPU 可以对整段 <code>.dma_desc</code> 设置 Non-cacheable；第三，数据 Buffer 至少不会和普通全局变量混在同一条 Cache Line 里；第四，换芯片或换板子时迁移成本更低。</p>
<h2 id="五mpu-初始化顺序先配置属性再打开-cache">五、MPU 初始化顺序：先配置属性，再打开 Cache</h2>
<p>MPU 和 Cache 的初始化顺序很重要。通常建议在系统启动早期完成：关闭 MPU，配置各 Region，开启 MPU，然后再开启 I-Cache / D-Cache。若系统已经跑起来再修改某段内存属性，就要非常小心先清理、失效相关 Cache，否则会留下难以复现的状态。</p>
<p>一个简化的初始化流程如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">system_memory_attr_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_Region_InitTypeDef</span> <span class="n">MPU_InitStruct</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Disable</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* AXI SRAM: default write-back cacheable */</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Enable</span>           <span class="o">=</span> <span class="n">MPU_REGION_ENABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span>           <span class="o">=</span> <span class="n">MPU_REGION_NUMBER0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span>      <span class="o">=</span> <span class="mh">0x24000000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span>             <span class="o">=</span> <span class="n">MPU_REGION_SIZE_512KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">AccessPermission</span> <span class="o">=</span> <span class="n">MPU_REGION_FULL_ACCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span>     <span class="o">=</span> <span class="n">MPU_ACCESS_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">TypeExtField</span>     <span class="o">=</span> <span class="n">MPU_TEX_LEVEL1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">DisableExec</span>      <span class="o">=</span> <span class="n">MPU_INSTRUCTION_ACCESS_DISABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">SubRegionDisable</span> <span class="o">=</span> <span class="mh">0x00</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* DMA descriptor window: non-cacheable */</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span>           <span class="o">=</span> <span class="n">MPU_REGION_NUMBER1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span>      <span class="o">=</span> <span class="mh">0x24070000</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span>             <span class="o">=</span> <span class="n">MPU_REGION_SIZE_16KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span>     <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span>      <span class="o">=</span> <span class="n">MPU_ACCESS_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Enable</span><span class="p">(</span><span class="n">MPU_PRIVILEGED_DEFAULT</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_EnableICache</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_EnableDCache</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这段代码不能原样复制到所有项目，因为不同芯片的 HAL 宏、地址、Region 大小和默认属性会不同。真正需要复制的是原则：外设寄存器保持 Device 属性；普通内存尽量 Cacheable；描述符区优先 Non-cacheable；大块数据区如果追求吞吐，就 Cacheable 加维护函数；所有区域基址和大小满足 MPU 规则。</p>
<h2 id="六发送方向案例cpu-生产数据dma-消费数据">六、发送方向案例：CPU 生产数据，DMA 消费数据</h2>
<p>发送方向是最容易讲清楚的场景。假设我们用 SPI DMA 发送一段采样数据，或者用 ETH DMA 发送网络帧。数据流是：CPU 写 Buffer，DMA 读 Buffer，外设发送。</p>
<p>正确流程应该是：</p>
<ol>
<li>CPU 填写 Buffer；</li>
<li>如果 Buffer 位于 Cacheable 区域，执行 Clean；</li>
<li>设置 DMA 源地址、长度、方向；</li>
<li>必要时执行内存屏障；</li>
<li>启动 DMA；</li>
<li>DMA 完成后释放 Buffer 或进入下一轮。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">eth_send_frame</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">len</span> <span class="o">&gt;</span> <span class="mi">1536</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* CPU 已经把以太网帧写入 buf。启动 DMA 前必须写回内存。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">__DMB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* 下面是假想接口：把 buf 交给 DMA 描述符 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">eth_tx_desc_prepare</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">eth_tx_kick_dma</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这里有一个容易被忽视的问题：如果描述符本身也在 Cacheable 区域，除了 Clean 数据 Buffer，还要 Clean 描述符。很多以太网驱动会把描述符设为 Non-cacheable，这样 CPU 改描述符后 DMA 立刻能看到，驱动逻辑更简单。代价是 CPU 访问描述符略慢，但描述符很小，这点开销通常可以接受。</p>
<p>对于 SDMMC 写卡、QSPI DMA 写 Flash、SAI / I2S 播放音频，道理完全一样。只要方向是 CPU 先写、DMA 后读，核心动作就是 Clean。</p>
<h2 id="七接收方向案例dma-生产数据cpu-消费数据">七、接收方向案例：DMA 生产数据，CPU 消费数据</h2>
<p>接收方向更容易出错，因为 CPU 可能在 DMA 完成前无意中读过 Buffer。例如网络栈初始化时清零、调试打印、协议栈预取，都会让对应 Cache Line 进入 D-Cache。DMA 完成后如果不 Invalidate，CPU 解析的仍然可能是旧内容。</p>
<p>正确流程应该是：</p>
<ol>
<li>准备空 Buffer；</li>
<li>如果 Buffer 曾被 CPU 写过，启动 DMA 前可根据情况 Clean 或 CleanInvalidate；</li>
<li>启动 DMA，让外设写入内存；</li>
<li>等待完成中断或轮询完成标志；</li>
<li>CPU 读取前执行 Invalidate；</li>
<li>解析 Buffer。</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">camera_capture_one</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="o">*</span><span class="n">frame</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="cm">/* 如果之前 CPU 清过 frame，且该区域可缓存，启动前先清理到内存。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">dcmi_start_dma</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">wait_frame_done</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">dcmi_stop_dma</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* DMA 已写完，CPU 读取前丢弃旧 Cache Line。 */</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">frame</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>有些资料会建议“接收前 Invalidate 一次，接收后再 Invalidate 一次”。这并非完全没有道理：接收前 Invalidate 可以避免某条 dirty Cache Line 在后续被替换时写回，覆盖 DMA 新数据；接收后 Invalidate 则确保 CPU 看到 DMA 结果。但工程上更推荐避免 CPU 在 DMA 运行期间访问同一 Buffer，并通过 Buffer 状态机明确所有权。如果所有权混乱，再多维护函数也只是降低概率，不是根治。</p>
<h2 id="八双缓冲与环形队列用所有权模型替代到处加维护函数">八、双缓冲与环形队列：用所有权模型替代“到处加维护函数”</h2>
<p>高吞吐外设通常不会只用一个 Buffer。摄像头有帧缓冲，音频有 ping-pong Buffer，网络有 RX / TX Descriptor Ring，SD 卡有块缓存。此时最重要的不是多调用几个 Cache API，而是建立清晰的 Buffer 所有权模型。</p>
<p>可以把每个 Buffer 的状态定义为：</p>
<ul>
<li><code>FREE</code>：CPU 可以填充或交给外设；</li>
<li><code>DMA_OWNED</code>：DMA 正在读写，CPU 禁止访问；</li>
<li><code>CPU_OWNED</code>：DMA 已完成，CPU 可以解析或修改；</li>
<li><code>QUEUED</code>：已经放入协议栈或应用队列，等待消费。</li>
</ul>
<p>发送方向中，Buffer 从 <code>CPU_OWNED</code> 变成 <code>DMA_OWNED</code> 的瞬间执行 Clean；接收方向中，Buffer 从 <code>DMA_OWNED</code> 变成 <code>CPU_OWNED</code> 的瞬间执行 Invalidate。维护动作绑定在状态转换点，而不是散落在业务代码里。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_FREE</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_CPU_OWNED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_DMA_OWNED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">BUF_QUEUED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">buf_state_t</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">volatile</span> <span class="kt">buf_state_t</span> <span class="n">state</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">dma_buf_t</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">rx_dma_complete_isr</span><span class="p">(</span><span class="kt">dma_buf_t</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">actual_len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">=</span> <span class="n">actual_len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">actual_len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">BUF_CPU_OWNED</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">tx_submit</span><span class="p">(</span><span class="kt">dma_buf_t</span> <span class="o">*</span><span class="n">b</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_clean_range</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">b</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">=</span> <span class="n">BUF_DMA_OWNED</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_submit_to_hw</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">,</span> <span class="n">b</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这种写法还有一个额外好处：后续如果你把某些 Buffer 改成 Non-cacheable，只需要调整 <code>dma_clean_range()</code> / <code>dma_invalidate_range()</code> 的实现或内存属性，不必到业务层逐个删代码。</p>
<p>（第二部分完，约 2600 字）</p>
<h2 id="九什么时候应该把-dma-区域设为-non-cacheable">九、什么时候应该把 DMA 区域设为 Non-cacheable？</h2>
<p>既然手动维护 Cache 这么容易出错，那是不是把所有 DMA Buffer 都设成 Non-cacheable 就好？答案是：可以，但要看数据规模和访问模式。</p>
<p>适合 Non-cacheable 的对象通常有三类。第一类是描述符、状态字、门铃寄存器镜像这类小对象，它们频繁被 CPU 和 DMA 交替访问，数据量小，追求确定性大于追求 Cache 命中。第二类是低速外设的小包 Buffer，例如 UART DMA 收几十字节命令，Cache 带来的收益有限。第三类是调试阶段，为了尽快排除一致性问题，可以临时把共享区设为 Non-cacheable，确认问题是否由 Cache 引起。</p>
<p>不适合 Non-cacheable 的对象也很典型：摄像头整帧、LCD Framebuffer、神经网络输入输出张量、网络大吞吐数据池、文件系统块缓存。如果 CPU 会对这些数据做大量扫描、拷贝、颜色转换、校验或协议解析，完全关闭 Cache 往往会让性能掉到不可接受。比如 800×480 的 RGB565 帧缓冲接近 750KB，CPU 绘制 UI 时如果每次都直接打到外部 SDRAM，刷新率和响应都会很难看。</p>
<p>因此更实用的策略是：<strong>描述符 Non-cacheable，大数据 Cacheable；维护动作集中封装；所有 Buffer 32 字节对齐；禁止 CPU 和 DMA 同时拥有同一 Buffer。</strong></p>
<h2 id="十性能调优一致性正确只是第一步">十、性能调优：一致性正确只是第一步</h2>
<p>很多项目修完 Cache bug 后才发现，性能仍然不理想。原因是 Cache 一致性解决的是“数据对不对”，而吞吐还受总线仲裁、内存 Bank、突发长度、访问模式影响。</p>
<h3 id="1-减少无意义的整帧-clean--invalidate">1. 减少无意义的整帧 Clean / Invalidate</h3>
<p>维护 Cache 也有成本。对一个几百 KB 的帧缓冲整帧 Invalidate，会占用明显时间。如果 DMA 实际只写了一个 ROI 区域，就只维护对应范围；如果网络包实际长度是 300 字节，不要对整个 1536 字节池做维护。</p>
<h3 id="2-避免-cpu-在-dma-期间扫描同一区域">2. 避免 CPU 在 DMA 期间扫描同一区域</h3>
<p>有些代码为了显示进度或做调试，会在 DMA 接收过程中读取 Buffer 的前几个字节。这会让 Cache 状态复杂化，也可能造成总线竞争。更好的做法是用单独的状态变量或 DMA 中断标志表示进度，不要偷看正在被 DMA 拥有的数据区。</p>
<h3 id="3-对齐不仅是-cache-line也是总线突发">3. 对齐不仅是 Cache Line，也是总线突发</h3>
<p>32 字节对齐解决 Cache Line 问题，但某些 DMA 或总线对 64 字节、128 字节边界更友好。图像和音频这类连续大块数据，尽量让行跨度、块大小、环形 Buffer 节点都接近硬件推荐的突发长度，减少跨 Bank 和非对齐传输。</p>
<h3 id="4-用-dwt-或-etm-做真实测量">4. 用 DWT 或 ETM 做真实测量</h3>
<p>不要只靠主观感觉判断优化效果。Cortex-M 提供 DWT cycle counter，可以很方便地测量 Clean、Invalidate、memcpy、协议解析耗时。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">dwt_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">CoreDebug</span><span class="o">-&gt;</span><span class="n">DEMCR</span> <span class="o">|=</span> <span class="n">CoreDebug_DEMCR_TRCENA_Msk</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CTRL</span> <span class="o">|=</span> <span class="n">DWT_CTRL_CYCCNTENA_Msk</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">uint32_t</span> <span class="nf">measure_invalidate_cycles</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">dma_invalidate_range</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">DWT</span><span class="o">-&gt;</span><span class="n">CYCCNT</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>把测量结果打印成表格，往往比猜测有效得多。你会看到维护 64 字节、1500 字节、64KB、整帧图像的成本完全不同，也会发现某些“为了保险”的全局维护动作其实非常昂贵。</p>
<h2 id="十一常见故障排查清单">十一、常见故障排查清单</h2>
<p>遇到 DMA + Cache 相关问题时，可以按下面顺序排查，效率会高很多。</p>
<ol>
<li><strong>确认 DMA 能访问该内存区域</strong>：很多 DMA 不能访问 DTCM，先查总线矩阵和参考手册，不要只看地址是 SRAM 就默认可以。</li>
<li><strong>确认 Buffer 地址和长度对齐</strong>：地址至少 32 字节对齐，长度向上补齐，最好让链接段边界也对齐。</li>
<li><strong>确认方向对应的维护动作</strong>：CPU 写 DMA 读用 Clean；DMA 写 CPU 读用 Invalidate；描述符也要考虑。</li>
<li><strong>确认维护时机</strong>：Clean 必须在启动 DMA 前；Invalidate 必须在 DMA 完成后、CPU 读取前。</li>
<li><strong>确认没有并发访问</strong>：DMA 拥有 Buffer 期间，CPU 不要读写同一范围。</li>
<li><strong>确认 MPU 属性符合预期</strong>：读回 MPU 配置或在启动日志打印 Region 表，不要只相信 CubeMX 或默认启动文件。</li>
<li><strong>确认优化等级下屏障仍然存在</strong>：关键状态切换处加 <code>__DMB()</code> / <code>__DSB()</code>，尤其是描述符交接和中断处理路径。</li>
<li><strong>临时关闭 D-Cache 做 A/B 测试</strong>：如果关闭后问题消失，基本可以把方向锁定在一致性维护或 MPU 属性。</li>
<li><strong>检查 map 文件</strong>：确认 Buffer 没被链接到错误区域，也没有因为拼写错误导致 section 属性失效。</li>
<li><strong>检查库代码隐藏访问</strong>：协议栈、文件系统、图形库可能提前读写 Buffer，需要把所有权规则延伸到库接口边界。</li>
</ol>
<h2 id="十二一个更完整的工程封装思路">十二、一个更完整的工程封装思路</h2>
<p>实际项目里，不建议业务代码直接调用 <code>SCB_CleanDCache_by_Addr()</code>。可以建立一个很薄的 <code>dma_mem</code> 模块，统一处理对齐、属性、统计和调试断言。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* dma_mem.h */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#pragma once
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="cp">#define DMA_CACHE_LINE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_SECTION __attribute__((section(&#34;.dma_buffer&#34;), aligned(DMA_CACHE_LINE)))
</span></span></span><span class="line"><span class="cl"><span class="cp">#define DMA_DESC_SECTION __attribute__((section(&#34;.dma_desc&#34;), aligned(DMA_CACHE_LINE)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_init</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_clean</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_invalidate</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_check_range</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">);</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* dma_mem.c */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;dma_mem.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;cmsis_gcc.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;core_cm7.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uintptr_t</span> <span class="nf">down</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">v</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">v</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uintptr_t</span> <span class="nf">up</span><span class="p">(</span><span class="kt">uintptr_t</span> <span class="n">v</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">(</span><span class="n">v</span> <span class="o">+</span> <span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">DMA_CACHE_LINE</span> <span class="o">-</span> <span class="mi">1U</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_clean</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">s</span> <span class="o">=</span> <span class="nf">down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">e</span> <span class="o">=</span> <span class="nf">up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_CleanDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">e</span> <span class="o">-</span> <span class="n">s</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">dma_mem_invalidate</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">s</span> <span class="o">=</span> <span class="nf">down</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uintptr_t</span> <span class="n">e</span> <span class="o">=</span> <span class="nf">up</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)</span><span class="n">addr</span> <span class="o">+</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">SCB_InvalidateDCache_by_Addr</span><span class="p">((</span><span class="kt">uint32_t</span> <span class="o">*</span><span class="p">)</span><span class="n">s</span><span class="p">,</span> <span class="p">(</span><span class="kt">int32_t</span><span class="p">)(</span><span class="n">e</span> <span class="o">-</span> <span class="n">s</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">    <span class="nf">__DSB</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>在 Debug 版本中，<code>dma_mem_check_range()</code> 可以检查地址是否落在允许 DMA 的内存段，长度是否超过 Buffer 边界，地址是否按 32 字节对齐。这样问题会在开发阶段暴露，而不是在客户现场变成偶发死机。</p>
<h2 id="十三把经验落到团队规范里">十三、把经验落到团队规范里</h2>
<p>Cache 一致性不是某个驱动工程师的个人习惯，而应该进入团队编码规范。建议至少写下以下约定：</p>
<ul>
<li>所有 DMA Buffer 必须使用统一宏声明，不允许随手 <code>static uint8_t buf[1024]</code>；</li>
<li>所有 DMA 描述符必须放在 <code>.dma_desc</code>；</li>
<li>所有 DMA 数据池必须放在 <code>.dma_buffer</code> 或明确标注 Non-cacheable；</li>
<li>驱动提交 Buffer 前后必须通过 <code>dma_mem_*</code> 接口维护；</li>
<li>DMA 拥有 Buffer 期间 CPU 禁止访问，调试打印也不例外；</li>
<li>新增外设驱动时必须在评审中说明内存区域、方向、维护动作和对齐策略；</li>
<li>每次修改链接脚本、MPU 初始化、外部 RAM 配置后必须跑 DMA 回归测试。</li>
</ul>
<p>回归测试可以设计得很朴素：构造不同长度、不同偏移、不同填充值的 Buffer，让 DMA 做内存到内存搬运或外设环回；CPU 在前后计算 CRC；同时测试 1 字节、31 字节、32 字节、33 字节、1500 字节、4096 字节等边界长度。很多隐藏问题会在 31 / 33 字节这种非整 Cache Line 长度上暴露。</p>
<h2 id="十四总结高性能-mcu-要用-soc-思维来写">十四、总结：高性能 MCU 要用 SoC 思维来写</h2>
<p>Cortex-M7 这类高性能 MCU 仍然保留了单片机的开发体验，但它的内存系统已经接近小型 SoC：Cache、MPU、多总线、多 SRAM 域、外部 RAM、多个 DMA 主设备同时存在。如果继续用“所有地址都等价”的旧模型写代码，项目越接近量产，越容易遇到偶发、难复现、难定位的问题。</p>
<p>把系统跑稳的关键并不复杂：先画清楚 CPU、DMA 和内存之间的访问路径；用 MPU 固化内存属性；用链接脚本隔离描述符和数据池；按方向正确执行 Clean / Invalidate；把维护动作绑定到 Buffer 所有权转换点；最后用 DWT、CRC、边界长度测试验证性能和正确性。</p>
<p>经验上，最可靠的方案往往不是“全局关闭 Cache”，也不是“到处补一行 Invalidate”，而是工程化地管理内存。描述符小而关键，可以 Non-cacheable；大块数据需要吞吐，就让它 Cacheable，但必须有严格的对齐、所有权和维护封装。这样既能保住 Cortex-M7 的性能，也能让 ETH、SDMMC、DCMI、LTDC、DMA2D 这些高带宽外设稳定工作。</p>
<p>如果你正在做 STM32H7 摄像头网关、工业以太网节点、带屏 UI 控制器、音频采集设备或端侧 AI 推理板卡，建议尽早把本文这套规范放进工程模板。越早建立内存区域和 DMA Buffer 的纪律，后期排查“偶现脏数据”的时间就越少，系统的可维护性也会高很多。</p>
<h2 id="十五附录一套最小回归用例">十五、附录：一套最小回归用例</h2>
<p>最后给一个简单但很有效的回归思路。准备一块 4KB 的 DMA 测试区，分别从偏移 0、1、15、31、32、33 字节开始测试，再覆盖长度 1、16、31、32、33、127、256、1500 字节。每次测试前由 CPU 写入递增模式，Clean 后让 DMA 搬运到另一块区域；DMA 完成后对目标区 Invalidate，再由 CPU 计算 CRC。随后反过来，让 DMA 写入源区，CPU 在 Invalidate 后校验。这个测试不依赖复杂外设，很多芯片可以用内存到内存 DMA 完成，适合作为板级 bring-up 的第一项检查。</p>
<p>如果这套边界测试能在开启 D-Cache、最高优化等级、RTOS 任务切换和中断压力同时存在的情况下连续运行数小时，说明内存属性、链接脚本、对齐封装和维护时机基本可信。反之，只要 31 字节、33 字节或非零偏移用例失败，就不要急着怀疑协议栈，先回到 Cache Line、所有权和 MPU Region 这三件事上排查。</p>
<p>（全文完，约 8000 字）</p>
]]></content:encoded>
    </item>
    <item>
      <title>STM32H7 双核通信实战：用 OpenAMP 与 RPMsg 打通 Cortex-M7 / Cortex-M4</title>
      <link>https://tech-snippets.xyz/posts/stm32h7-openamp-rpmsg-dual-core-guide/</link>
      <pubDate>Fri, 05 Jun 2026 03:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/stm32h7-openamp-rpmsg-dual-core-guide/</guid>
      <description>从内存分区、启动顺序、resource table、RPMsg 协议到 Cache 一致性，完整讲解 STM32H7 双核工程的通信设计与调试方法。</description>
      <content:encoded><![CDATA[<h2 id="引言双核-mcu-的难点不在多一个核而在边界设计">引言：双核 MCU 的难点不在“多一个核”，而在边界设计</h2>
<p>STM32H745、STM32H747、STM32H755、STM32H757 这类双核 MCU 看起来很诱人：一个 Cortex-M7 跑到几百 MHz，带 I-Cache、D-Cache、FPU 和丰富高速外设；另一个 Cortex-M4 更适合处理中断、采样、控制环和低抖动任务。理论上，把 UI、网络、文件系统、机器视觉前处理放到 M7，把电机控制、ADC 采样、CAN 通信、保护逻辑放到 M4，就能同时得到吞吐量和实时性。</p>
<p>但真正做项目时，问题往往不是“两个核能不能同时跑起来”，而是：谁负责启动谁？共享内存放哪里？消息格式怎么演进？M7 打开 D-Cache 后 M4 为什么收不到新数据？M4 卡死后 M7 如何降级？量产后如何定位一条跨核消息到底丢在哪个阶段？这些问题如果没有在架构阶段想清楚，后面会变成非常难排查的随机故障。</p>
<p>本文以 STM32H7 双核系列为背景，讲一套比较稳妥的 OpenAMP / RPMsg 通信方案。OpenAMP 原本常见于 Linux + MCU 的异构多核系统，在 STM32H7 上也可以作为 Cortex-M7 与 Cortex-M4 之间的消息层。它的价值不是让代码看起来“高级”，而是把共享内存、vring、endpoint、resource table、通知中断这些细节收敛成一套可维护的模型。</p>
<p><img alt="STM32H7 OpenAMP RPMsg 架构" loading="lazy" src="/images/stm32h7-openamp-rpmsg-architecture.svg"></p>
<p>这篇文章不会停留在概念层面。我们会从芯片启动模型讲起，逐步进入内存布局、CubeMX 配置、resource table、RPMsg 端点设计、Cache 一致性、协议封装、调试手段和常见故障。文中的代码偏向工程骨架，目的是让你知道每个模块应该放在哪里，以及哪些地方必须根据具体板卡调整。</p>
<h2 id="一先给两个核心分工m7-管复杂m4-管确定">一、先给两个核心分工：M7 管复杂，M4 管确定</h2>
<p>双核 MCU 最容易犯的错误，是把它当成“两个单片机焊在一起”。如果 M7 和 M4 都直接操作同一批外设、都可以改同一段共享变量、都能决定系统状态，那通信层迟早会变成一团乱麻。比较可控的做法是先明确边界：M7 负责复杂业务，M4 负责确定性任务。</p>
<p>例如一个带触摸屏和电机的设备，可以这样拆分：</p>
<ul>
<li>M7：图形界面、参数管理、以太网或 Wi-Fi 网关、日志、文件系统、OTA、上位机协议解析；</li>
<li>M4：PWM 输出、编码器采样、ADC 采样、过流保护、实时状态机、CAN 或 RS485 的周期帧；</li>
<li>双核通信：M7 下发参数和控制命令，M4 上报状态、故障码和采样摘要。</li>
</ul>
<p>这种分工的好处是业务语义清晰。M7 可以处理复杂但不那么确定的任务，偶尔因为文件系统或网络协议阻塞几十毫秒，也不会直接影响 M4 的控制环。M4 则尽量不做字符串解析、大块内存申请和复杂协议栈，只保证实时任务稳定运行。</p>
<p>如果某个外设必须两个核都知道状态，也尽量采用“单拥有者 + 消息镜像”的方式。比如 ADC DMA 缓冲区由 M4 拥有，M7 不直接读 DMA 正在写的原始缓冲，而是让 M4 周期性整理出摘要或复制一份快照，再通过 RPMsg 发给 M7。这样看似多了一次复制，实际换来了边界清楚和调试方便。</p>
<h2 id="二stm32h7-双核启动模型谁先活谁释放谁">二、STM32H7 双核启动模型：谁先活，谁释放谁</h2>
<p>在 STM32H7 双核设备里，M7 通常作为主核先启动，M4 可以处于保持状态，等待 M7 完成系统时钟、内存、外设和共享区初始化后再释放。不同工程可以选择不同启动策略，但对于 OpenAMP 通信来说，最推荐的方式是：M7 负责系统级初始化，然后启动 M4；M4 启动后初始化自己的 OpenAMP 端点并进入消息循环。</p>
<p>典型启动流程如下：</p>
<ol>
<li>复位后 M7 从 Flash 启动；</li>
<li>M7 配置系统时钟、电源域、MPU、Cache 和必要的共享内存区域；</li>
<li>M7 初始化 OpenAMP 框架，准备 vring 和 resource table 所需的共享区；</li>
<li>M7 通过 HAL_RCCEx_EnableBootCore 或相关机制释放 M4；</li>
<li>M4 从自己的向量表地址启动，初始化 HAL、外设、OpenAMP；</li>
<li>双方创建 RPMsg endpoint，交换握手消息；</li>
<li>握手完成后，应用层才允许发送控制命令。</li>
</ol>
<p>实际项目中，不建议上电后马上发送业务消息。两个核的启动时间不完全一致，某些板卡还会因为外部晶振、外设复位、电源时序造成波动。最好定义一个 <code>HELLO / READY / VERSION</code> 握手过程，在状态没有进入 <code>LINK_READY</code> 前，业务层只缓存关键命令或直接返回“从核未就绪”。</p>
<p>一个简化的状态机可以这样设计：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_DOWN</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_BOOTING</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_ENDPOINT_CREATED</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_WAIT_REMOTE_READY</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_READY</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_FAULT</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">ipc_state_t</span><span class="p">;</span>
</span></span></code></pre></div><p>M7 侧不要只依赖“函数返回成功”判断链路可用，而应该用明确的远端消息确认。OpenAMP 初始化成功，只说明本地数据结构准备好了；真正能不能收发，还要看 M4 是否完成 endpoint 创建、是否能响应版本查询、是否能处理 ring buffer。</p>
<h2 id="三共享内存布局最怕差不多能用">三、共享内存布局：最怕“差不多能用”</h2>
<p>双核通信的底层一定离不开共享内存。OpenAMP / RPMsg 常用的结构包括 resource table、两个 vring、消息 buffer 以及可能的私有共享数据区。这里最重要的原则是：地址固定、两核一致、MPU 属性明确、不要和堆栈抢空间。</p>
<p>一个工程化的链接脚本布局大概会预留如下区域：</p>
<pre tabindex="0"><code class="language-ld" data-lang="ld">/* 示例地址需要按具体 STM32H7 型号和 SRAM 分区调整 */
RAM_D2 (xrw)      : ORIGIN = 0x30000000, LENGTH = 256K
SHM_IPC (xrw)     : ORIGIN = 0x30040000, LENGTH = 64K
</code></pre><p>在 M7 和 M4 的工程里，都必须以同样的地址理解 <code>SHM_IPC</code>。如果 M7 认为 vring 在 <code>0x30040000</code>，M4 却因为链接脚本或宏定义不同把它放到 <code>0x30020000</code>，现象往往不是直接报错，而是偶发 HardFault、消息错乱或初始化超时。</p>
<p>建议把共享内存配置集中放在一个头文件里，并让 M7、M4 两个工程都引用同一份定义：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define IPC_SHM_BASE        0x30040000UL
</span></span></span><span class="line"><span class="cl"><span class="cp">#define IPC_SHM_SIZE        0x00010000UL
</span></span></span><span class="line"><span class="cl"><span class="cp">#define IPC_VRING_SIZE      0x00001000UL
</span></span></span><span class="line"><span class="cl"><span class="cp">#define IPC_BUFFER_COUNT    16U
</span></span></span><span class="line"><span class="cl"><span class="cp">#define IPC_BUFFER_SIZE     512U
</span></span></span></code></pre></div><p>同时要注意 STM32H7 的内存域。D1、D2、D3 SRAM 的访问权限、总线连接和性能不同。M7 访问某些区域带 Cache，M4 访问同一区域时不经过 M7 的 D-Cache。如果没有正确配置 MPU 或手动清理 / 失效 Cache，M7 写入的内容可能还停留在自己的 Cache line 中，M4 读到的仍然是旧值。这是双核通信里最典型、也最容易误判的软件 bug。</p>
<p>（第一部分完，约2100字）</p>
<h2 id="四cubemx-配置要点不要只勾-openamp">四、CubeMX 配置要点：不要只勾 OpenAMP</h2>
<p>很多人第一次用 STM32CubeMX 配 OpenAMP，会以为勾选中间件就结束了。实际上 CubeMX 只是生成一部分框架代码，工程能不能稳定运行，还取决于内存、启动、HSEM、中断和 Cache 配置。</p>
<p>首先要确认工程是双核工程，分别生成 CM7 和 CM4 两个目标。M7 工程一般包含系统时钟和全局初始化，M4 工程只初始化自己需要的外设。外设归属尽量在设计阶段确定，不要两个核都初始化同一个外设实例。比如 TIM1 如果由 M4 做 PWM，就不要在 M7 工程里也生成 TIM1 初始化代码。</p>
<p>其次是 HSEM。STM32H7 提供硬件信号量，可以用于核间同步和事件通知。OpenAMP 端口通常会使用 HSEM 或 IPCC 类似机制来触发对方处理消息。即便应用层不直接调用 HSEM，也要理解它的存在：如果中断优先级被错误配置、HSEM 时钟未打开、或者中断处理函数没有正确调用 OpenAMP poll，RPMsg 可能会表现为“发送成功但对方不处理”。</p>
<p>第三是内存区域。对于共享内存，最好通过 MPU 把它设置成 non-cacheable，或者在每次访问前后严格执行 cache clean / invalidate。为了减少出错概率，很多控制类项目会选择把 IPC 共享区配置为非缓存区域。性能上会损失一点，但消息包通常不大，换来的确定性更值钱。</p>
<p>M7 侧 MPU 示例思路如下，具体 API 名称和参数以 HAL 版本为准：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">void</span> <span class="nf">MPU_Config_IPC</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_Region_InitTypeDef</span> <span class="n">MPU_InitStruct</span> <span class="o">=</span> <span class="p">{</span><span class="mi">0</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Disable</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Enable</span> <span class="o">=</span> <span class="n">MPU_REGION_ENABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Number</span> <span class="o">=</span> <span class="n">MPU_REGION_NUMBER2</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">BaseAddress</span> <span class="o">=</span> <span class="n">IPC_SHM_BASE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">Size</span> <span class="o">=</span> <span class="n">MPU_REGION_SIZE_64KB</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">SubRegionDisable</span> <span class="o">=</span> <span class="mh">0x00</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">TypeExtField</span> <span class="o">=</span> <span class="n">MPU_TEX_LEVEL1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">AccessPermission</span> <span class="o">=</span> <span class="n">MPU_REGION_FULL_ACCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">DisableExec</span> <span class="o">=</span> <span class="n">MPU_INSTRUCTION_ACCESS_DISABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsShareable</span> <span class="o">=</span> <span class="n">MPU_ACCESS_SHAREABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsCacheable</span> <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_CACHEABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="n">MPU_InitStruct</span><span class="p">.</span><span class="n">IsBufferable</span> <span class="o">=</span> <span class="n">MPU_ACCESS_NOT_BUFFERABLE</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_ConfigRegion</span><span class="p">(</span><span class="o">&amp;</span><span class="n">MPU_InitStruct</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">HAL_MPU_Enable</span><span class="p">(</span><span class="n">MPU_PRIVILEGED_DEFAULT</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>如果你选择保留 Cache，则必须把 RPMsg buffer 对齐到 Cache line，并在发送前 <code>SCB_CleanDCache_by_Addr</code>，接收前后 <code>SCB_InvalidateDCache_by_Addr</code>。这条路对性能更友好，但对团队纪律要求更高。只要某个新同事在共享区里新增一个结构体却忘了对齐，就可能引入难以复现的问题。</p>
<h2 id="五resource-table让两边说清楚共享资源在哪里">五、resource table：让两边说清楚“共享资源在哪里”</h2>
<p>OpenAMP 里的 resource table 可以理解为远端处理器描述自己需要哪些共享资源的表。它通常包含 vring 地址、大小、对齐方式、设备特征等信息。在 STM32H7 这种双 MCU 核场景里，resource table 不一定像 Linux remoteproc 那样完整复杂，但它仍然是双方理解 RPMsg 设备的重要依据。</p>
<p>你可以把 resource table 看成一份低层契约：</p>
<ul>
<li>vring0 和 vring1 的地址在哪里；</li>
<li>每个 ring 有多少 descriptor；</li>
<li>buffer 区域从哪里开始；</li>
<li>对齐边界是多少；</li>
<li>通知机制和特征位如何设置。</li>
</ul>
<p>不要在多个文件里散落这些地址。建议把地址定义、resource table 和链接脚本三者一起评审。尤其在项目后期，开发者很容易为了“临时加一个 DMA 缓冲区”挪动 SRAM，结果忘了同步 OpenAMP 地址，导致双核通信莫名其妙失效。</p>
<p>可以用编译期断言减少这类错误：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define STATIC_ASSERT(COND, MSG) typedef char static_assertion_##MSG[(COND)?1:-1]
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="nf">STATIC_ASSERT</span><span class="p">((</span><span class="n">IPC_SHM_BASE</span> <span class="o">%</span> <span class="mh">0x1000</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="n">ipc_base_must_4k_aligned</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nf">STATIC_ASSERT</span><span class="p">(</span><span class="n">IPC_SHM_SIZE</span> <span class="o">&gt;=</span> <span class="mh">0x10000</span><span class="p">,</span> <span class="n">ipc_shm_too_small</span><span class="p">);</span>
</span></span></code></pre></div><p>另外，resource table 和共享 buffer 最好不要放在会被 C 运行库初始化清零的普通 <code>.bss</code> 区里。两核启动顺序不同，一方刚初始化好的 ring，另一方如果随后执行清零动作，就会把 ring 状态破坏掉。工程中应明确哪些段由谁初始化，哪些段是 noinit，哪些段在复位后必须重建。</p>
<h2 id="六rpmsg-endpoint不要把消息通道设计成全局变量垃圾桶">六、RPMsg endpoint：不要把消息通道设计成全局变量垃圾桶</h2>
<p>RPMsg 的 endpoint 提供了类似“命名端口”的抽象。一个 endpoint 可以有名称、地址和回调函数。很多示例工程只创建一个 endpoint，然后所有消息都塞进去，通过第一个字节区分类型。这种做法在 demo 里没问题，但项目复杂后会变成维护灾难。</p>
<p>更推荐的方式是按业务域拆分 endpoint 或至少拆分协议模块。例如：</p>
<ul>
<li><code>rpmsg-ctrl</code>：启动握手、版本查询、心跳、复位请求；</li>
<li><code>rpmsg-param</code>：参数读写、参数校验、参数持久化通知；</li>
<li><code>rpmsg-telemetry</code>：状态上报、采样摘要、故障码；</li>
<li><code>rpmsg-debug</code>：诊断命令、日志抓取、性能统计。</li>
</ul>
<p>如果底层端口限制或资源较少，也可以只建一个 endpoint，但应用层协议必须模块化，不能让每个业务文件都直接调用 <code>OPENAMP_send</code>。建议封装一个 <code>ipc_send()</code>，统一处理超时、序列号、统计和错误码。</p>
<p>一个简单消息头可以这样定义：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define IPC_MAGIC 0x48495043u  </span><span class="cm">/* &#34;HIPC&#34; */</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#define IPC_VERSION 1u
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="nf">__attribute__</span><span class="p">((</span><span class="n">packed</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">magic</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">version</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">type</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">seq</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">timestamp_ms</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">payload_len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">ipc_msg_header_t</span><span class="p">;</span>
</span></span></code></pre></div><p>每条消息带上 <code>magic</code> 和 <code>version</code>，可以避免 M7、M4 固件版本不一致时互相误解。<code>seq</code> 用于请求响应匹配和丢包统计，<code>timestamp_ms</code> 用于分析延迟。<code>flags</code> 可以标记是否需要 ACK、是否为分片、是否为错误响应。</p>
<p>发送函数不要只返回 HAL 状态，而要返回业务可理解的错误：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_OK</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_ERR_NOT_READY</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_ERR_TIMEOUT</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_ERR_TOO_LARGE</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_ERR_BUSY</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">IPC_ERR_BAD_ARG</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">ipc_result_t</span><span class="p">;</span>
</span></span></code></pre></div><p>这样上层代码才知道是链路没准备好、消息过大、ring 满了，还是等待 ACK 超时。对于需要安全保护的设备，M7 下发关键控制命令后必须等待 M4 确认，不能假设“发送成功”等于“执行成功”。</p>
<p>（第二部分完，约2300字）</p>
<h2 id="七m7-侧工程骨架主控路由和超时">七、M7 侧工程骨架：主控、路由和超时</h2>
<p>M7 侧一般承担主控角色。它需要负责启动 M4、初始化 OpenAMP、创建 endpoint、发送握手，并维护链路状态。下面是一个简化骨架，省略了具体 HAL 初始化细节：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="k">volatile</span> <span class="kt">ipc_state_t</span> <span class="n">g_ipc_state</span> <span class="o">=</span> <span class="n">IPC_DOWN</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="k">struct</span> <span class="n">rpmsg_endpoint</span> <span class="n">g_ctrl_ept</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">uint32_t</span> <span class="n">g_seq</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">int</span> <span class="nf">ctrl_rx_cb</span><span class="p">(</span><span class="k">struct</span> <span class="n">rpmsg_endpoint</span> <span class="o">*</span><span class="n">ept</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                      <span class="kt">uint32_t</span> <span class="n">src</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">&lt;</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">ipc_msg_header_t</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">RPMSG_SUCCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">const</span> <span class="kt">ipc_msg_header_t</span> <span class="o">*</span><span class="n">hdr</span> <span class="o">=</span> <span class="p">(</span><span class="k">const</span> <span class="kt">ipc_msg_header_t</span> <span class="o">*</span><span class="p">)</span><span class="n">data</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">hdr</span><span class="o">-&gt;</span><span class="n">magic</span> <span class="o">!=</span> <span class="n">IPC_MAGIC</span> <span class="o">||</span> <span class="n">hdr</span><span class="o">-&gt;</span><span class="n">version</span> <span class="o">!=</span> <span class="n">IPC_VERSION</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">RPMSG_SUCCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">switch</span> <span class="p">(</span><span class="n">hdr</span><span class="o">-&gt;</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">case</span> <span class="nl">IPC_MSG_M4_READY</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">g_ipc_state</span> <span class="o">=</span> <span class="n">IPC_READY</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">case</span> <span class="nl">IPC_MSG_TELEMETRY</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nf">telemetry_handle</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">case</span> <span class="nl">IPC_MSG_FAULT</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nf">fault_handle_from_m4</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">default</span><span class="o">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">RPMSG_SUCCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">ipc_m7_task</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">ipc_lowlevel_init</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="nf">ipc_boot_m4</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="n">g_ipc_state</span> <span class="o">=</span> <span class="n">IPC_BOOTING</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="nf">OPENAMP_create_endpoint</span><span class="p">(</span><span class="o">&amp;</span><span class="n">g_ctrl_ept</span><span class="p">,</span> <span class="s">&#34;rpmsg-ctrl&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">           <span class="n">RPMSG_ADDR_ANY</span><span class="p">,</span> <span class="n">ctrl_rx_cb</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">)</span> <span class="o">!=</span> <span class="n">RPMSG_SUCCESS</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">OPENAMP_check_for_message</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">g_ipc_state</span> <span class="o">=</span> <span class="n">IPC_WAIT_REMOTE_READY</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">OPENAMP_check_for_message</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">ipc_periodic_heartbeat</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">ipc_check_timeouts</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">osDelay</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>实际项目中，<code>OPENAMP_check_for_message()</code> 可以放在专门任务里，也可以由中断通知后释放信号量再处理。重点是不要在高优先级中断里做复杂协议解析。中断里只做最小通知，解析放到任务上下文，避免影响实时性。</p>
<p>M7 下发参数时，建议采用“影子参数 + 提交”的两阶段方式。比如先把一组 PID 参数写入 M4 的待生效区，M4 校验范围、单位和互斥关系后返回 <code>PARAM_ACCEPTED</code>，最后 M7 再发送 <code>PARAM_COMMIT</code>。这样可以避免一半参数已生效、一半参数未生效导致控制状态不一致。</p>
<h2 id="八m4-侧工程骨架实时任务优先通信任务次之">八、M4 侧工程骨架：实时任务优先，通信任务次之</h2>
<p>M4 的第一原则是不要让通信破坏实时任务。OpenAMP 回调里不要做耗时计算，不要直接改控制环正在使用的结构体，更不要在回调里写 Flash。比较稳妥的做法是把收到的消息放入一个小队列，由低优先级任务处理；实时控制任务只在安全点读取已经验证过的参数快照。</p>
<p>M4 侧可以这样组织：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="k">struct</span> <span class="n">rpmsg_endpoint</span> <span class="n">g_ctrl_ept</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="k">volatile</span> <span class="kt">bool</span> <span class="n">g_link_ready</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">int</span> <span class="nf">m4_ctrl_rx_cb</span><span class="p">(</span><span class="k">struct</span> <span class="n">rpmsg_endpoint</span> <span class="o">*</span><span class="n">ept</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                         <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                         <span class="kt">uint32_t</span> <span class="n">src</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">ipc_queue_push_from_callback</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">RPMSG_SUCCESS</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">ipc_m4_task</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">MX_OPENAMP_Init</span><span class="p">(</span><span class="n">RPMSG_REMOTE</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">OPENAMP_create_endpoint</span><span class="p">(</span><span class="o">&amp;</span><span class="n">g_ctrl_ept</span><span class="p">,</span> <span class="s">&#34;rpmsg-ctrl&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                            <span class="n">RPMSG_ADDR_ANY</span><span class="p">,</span> <span class="n">m4_ctrl_rx_cb</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">ipc_send_ready</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="n">g_link_ready</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">OPENAMP_check_for_message</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">ipc_process_queued_messages</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">ipc_send_telemetry_if_due</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">osDelay</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>控制环任务则完全不依赖 RPMsg 的即时响应：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">motor_control_task</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">control_param_t</span> <span class="n">local_param</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">param_has_pending_commit</span><span class="p">())</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="nf">taskENTER_CRITICAL</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">            <span class="n">local_param</span> <span class="o">=</span> <span class="n">g_committed_param</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="nf">taskEXIT_CRITICAL</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="nf">adc_sample_update</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">foc_control_step</span><span class="p">(</span><span class="o">&amp;</span><span class="n">local_param</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="nf">pwm_update</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">        <span class="nf">wait_next_period</span><span class="p">();</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这种结构的关键是“通信带来新信息，但不直接打断实时节奏”。如果 M7 突然连续发很多参数或诊断命令，M4 可以丢弃低优先级消息、延迟处理日志请求，但不能牺牲控制周期。</p>
<h2 id="九cache-一致性十个双核疑难杂症七个和它有关">九、Cache 一致性：十个双核疑难杂症，七个和它有关</h2>
<p>STM32H7 的 M7 带 D-Cache，这是性能来源，也是双核通信的坑。常见现象包括：M7 明明写了消息，M4 读到旧内容；M4 写了状态，M7 偶尔看不到更新；调试模式单步正常，全速运行就失败；关闭 Cache 后问题消失但系统变慢。</p>
<p>根因很简单：M7 的 Cache 和 M4 看到的 SRAM 不是同一个“时间点”。如果共享内存是 cacheable，M7 对内存的写入可能先进入 Cache，尚未写回 SRAM。M4 没有经过 M7 的 Cache，自然读不到新值。反过来，M4 修改了 SRAM，M7 如果 Cache line 里还有旧值，也可能继续读旧数据。</p>
<p>有三种处理策略：</p>
<ol>
<li>把 OpenAMP 共享区设为 non-cacheable；</li>
<li>保持 cacheable，但所有共享 buffer 严格做 clean / invalidate；</li>
<li>消息区 non-cacheable，大块数据区 cacheable 并手动维护。</li>
</ol>
<p>对于多数中小型项目，第一种最省心。RPMsg 消息通常只有几十到几百字节，不值得为了这点性能冒一致性风险。对于摄像头图像、音频块、波形缓存这类大数据，可以单独设计共享数据区，使用显式所有权转移：写方填充完成后 clean，发送消息通知；读方收到消息后 invalidate，再读取数据；读完后通过消息归还 buffer。</p>
<p>还要注意对齐。Cache 维护函数通常要求地址和长度按 Cache line 对齐，否则可能影响相邻数据。建议共享结构体使用 32 字节对齐：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#define CACHE_LINE_SIZE 32U
</span></span></span><span class="line"><span class="cl"><span class="cp">#define ALIGN_CACHE __attribute__((aligned(CACHE_LINE_SIZE)))
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="n">ALIGN_CACHE</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint8_t</span> <span class="n">data</span><span class="p">[</span><span class="mi">512</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">ipc_aligned_buffer_t</span><span class="p">;</span>
</span></span></code></pre></div><p>如果团队里有人说“加个 delay 就好了”，基本可以优先怀疑 Cache 或同步问题。delay 只是把竞态窗口遮住，并没有解决根因。</p>
<h2 id="十协议设计把能收发升级成能维护">十、协议设计：把“能收发”升级成“能维护”</h2>
<p>项目早期，双方能互相打印一句 hello 就很有成就感。但量产项目真正需要的是可维护协议。至少要考虑以下字段：消息类型、版本、长度、序列号、时间戳、错误码和可选校验。</p>
<p>一个参数写入流程可以这样定义：</p>
<ol>
<li>M7 发送 <code>PARAM_SET_REQ</code>，包含参数 ID、长度和值；</li>
<li>M4 检查参数 ID 是否存在、长度是否匹配、值域是否合法；</li>
<li>M4 返回 <code>PARAM_SET_RSP</code>，携带 <code>OK</code> 或具体错误码；</li>
<li>M7 收到所有参数设置成功后发送 <code>PARAM_COMMIT_REQ</code>；</li>
<li>M4 在控制环安全点切换参数并返回 <code>PARAM_COMMIT_RSP</code>。</li>
</ol>
<p>错误码不要只用一个 <code>FAIL</code>。建议至少区分：未知命令、版本不兼容、参数不存在、长度错误、值越界、当前状态不允许、内部忙、执行超时。这样现场日志才有分析价值。</p>
<p>心跳也很重要。M7 可以每 100 ms 或 500 ms 发送 heartbeat，M4 回应自己的运行计数、控制周期最大耗时、故障状态和消息队列深度。M7 如果连续多次收不到心跳，应进入降级策略：禁止新的危险命令、提示用户检查从核、必要时尝试重启 M4。</p>
<p>一个心跳 payload 示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="nf">__attribute__</span><span class="p">((</span><span class="n">packed</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">uptime_ms</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">control_loop_count</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">max_loop_us</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">queue_depth</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">fault_bitmap</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">ipc_heartbeat_payload_t</span><span class="p">;</span>
</span></span></code></pre></div><p>这些字段在开发阶段也很有用。比如 max_loop_us 偶尔跳高，说明 M4 实时任务被打断或某个临界区太长；queue_depth 持续升高，说明 M7 发得太快或 M4 处理不过来；fault_bitmap 可以让 M7 UI 直接显示底层保护状态。</p>
<h2 id="十一调试方法不要只盯串口打印">十一、调试方法：不要只盯串口打印</h2>
<p>双核系统最怕“两个核都在打印，但你不知道先后顺序”。串口打印本身可能阻塞，还会改变时序。建议从一开始就做轻量级事件追踪。每个核维护一个环形 trace buffer，记录事件 ID、时间戳和参数，必要时由 M7 拉取并导出。</p>
<p>事件可以包括：</p>
<ul>
<li>M7 释放 M4；</li>
<li>endpoint 创建成功；</li>
<li>收到远端 READY；</li>
<li>发送消息失败；</li>
<li>ring buffer 满；</li>
<li>心跳超时；</li>
<li>参数提交成功；</li>
<li>M4 控制周期超限。</li>
</ul>
<p>示例结构：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">tick</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">event</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint16_t</span> <span class="n">arg0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">arg1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="kt">trace_item_t</span><span class="p">;</span>
</span></span></code></pre></div><p>调试时还可以准备一个“回环命令”：M7 发送指定长度的数据，M4 原样返回，M7 统计往返延迟和失败次数。这个命令简单但非常有效，能快速判断链路、Cache、ring 深度和任务调度是否健康。</p>
<p>如果通信偶发卡死，排查顺序建议如下：</p>
<ol>
<li>两个核是否都还在运行，心跳计数是否增长；</li>
<li>HSEM 或通知中断是否触发；</li>
<li>vring descriptor 是否被消耗但未释放；</li>
<li>共享内存地址是否被其他模块覆盖；</li>
<li>Cache 属性是否与预期一致；</li>
<li>是否在回调中执行了阻塞操作；</li>
<li>是否存在两个核同时写同一个结构体的情况。</li>
</ol>
<p>用 J-Link 或 ST-LINK 同时调两个核时，要留意断点会改变时序。M7 停住时 M4 可能继续跑，导致心跳超时；M4 停住时 M7 可能不断重试，ring 被填满。因此调试模式下可以放宽超时时间，或者加入 <code>debug_freeze</code> 开关。</p>
<h2 id="十二常见坑与解决方案">十二、常见坑与解决方案</h2>
<p><strong>1. M7 能启动，M4 没反应。</strong> 先检查 M4 的 boot 地址、选项字节、工程向量表和 M7 释放 M4 的代码。确认 M4 工程确实被烧录到正确地址。很多问题不是 OpenAMP，而是 M4 根本没有启动。</p>
<p><strong>2. endpoint 创建成功，但收不到消息。</strong> 检查 HSEM / 中断回调是否接入 OpenAMP，<code>OPENAMP_check_for_message()</code> 是否被周期调用，远端 endpoint 名称是否一致。还要确认两个核使用的 OpenAMP 配置和 resource table 地址一致。</p>
<p><strong>3. 小消息正常，大消息失败。</strong> 检查 RPMsg buffer 大小、payload 最大长度和消息头开销。不要假设 512 字节 buffer 可以发送 512 字节业务数据，因为还要扣除 RPMsg 和自定义头部。</p>
<p><strong>4. Debug 正常，Release 异常。</strong> 优先怀疑优化导致的竞态、volatile 缺失、内存屏障和 Cache 一致性。共享状态标志必须使用 <code>volatile</code> 或 RTOS 同步原语，跨核共享数据需要明确所有权。</p>
<p><strong>5. 跑一段时间后 ring 满。</strong> 可能是接收方没有及时 poll，或者某条错误路径没有释放 buffer。给发送失败、接收回调、队列满都加计数器，不要只打印一次错误。</p>
<p><strong>6. M4 实时任务抖动增加。</strong> 检查 OpenAMP 处理任务优先级是否过高，回调是否做了耗时工作，M7 是否发送了过于频繁的诊断请求。通信任务应该服务实时任务，而不是压制实时任务。</p>
<h2 id="十三一个推荐的工程目录结构">十三、一个推荐的工程目录结构</h2>
<p>为了让双核工程长期可维护，目录结构也要有边界。可以参考下面这种组织：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">project/
</span></span><span class="line"><span class="cl">  CM7/
</span></span><span class="line"><span class="cl">    Core/
</span></span><span class="line"><span class="cl">    App/
</span></span><span class="line"><span class="cl">      ipc_master.c
</span></span><span class="line"><span class="cl">      ipc_protocol.c
</span></span><span class="line"><span class="cl">      telemetry_view.c
</span></span><span class="line"><span class="cl">  CM4/
</span></span><span class="line"><span class="cl">    Core/
</span></span><span class="line"><span class="cl">    App/
</span></span><span class="line"><span class="cl">      ipc_remote.c
</span></span><span class="line"><span class="cl">      motor_control.c
</span></span><span class="line"><span class="cl">      fault_manager.c
</span></span><span class="line"><span class="cl">  Shared/
</span></span><span class="line"><span class="cl">    ipc_config.h
</span></span><span class="line"><span class="cl">    ipc_protocol.h
</span></span><span class="line"><span class="cl">    ipc_trace.h
</span></span><span class="line"><span class="cl">    memory_map.h
</span></span></code></pre></div><p><code>Shared</code> 目录只放纯头文件或两核都能编译的轻量逻辑，不要放依赖某个核外设的代码。<code>ipc_protocol.h</code> 中定义消息类型、结构体和错误码，确保两边协议一致。每次修改协议，都应更新版本号，并保留必要的兼容处理。</p>
<p>如果项目使用 CI，可以增加一个小脚本检查 M7、M4 是否引用同一份协议头，甚至把协议结构体大小打印出来做静态验证。嵌入式项目不一定要复杂的自动化，但对双核协议这种关键边界，自动检查非常值得。</p>
<h2 id="十四什么时候不该用-openamp">十四、什么时候不该用 OpenAMP？</h2>
<p>OpenAMP 不是银弹。如果你的双核通信只是几个状态位，频率很低，且团队对 OpenAMP 不熟，直接使用 HSEM + 共享结构体也可以。但前提是结构体有明确所有权、版本和同步机制。反过来，如果项目需要多种消息类型、请求响应、远端状态管理、后续可能迁移到 Linux + M4 或更复杂 SoC，那么 OpenAMP / RPMsg 的抽象就更有价值。</p>
<p>还有一种情况要谨慎：超高频小包通信。RPMsg 有通用性，也有开销。如果你试图每个控制周期都通过 RPMsg 发送几十个小包，可能会把系统拖进无意义的调度和拷贝中。对于高频数据，应该使用共享环形缓冲区或双缓冲，RPMsg 只发送“新数据块已准备好”的通知。</p>
<h2 id="总结双核系统的稳定性来自清晰契约">总结：双核系统的稳定性来自清晰契约</h2>
<p>STM32H7 双核开发的核心，不是把 M7 和 M4 都点亮，也不是让 OpenAMP 示例跑起来，而是建立一套清晰、可验证、可调试的跨核契约。这个契约至少包括：启动顺序、共享内存地址、Cache 属性、endpoint 命名、消息头格式、错误码、超时策略、参数提交流程和故障降级方案。</p>
<p>工程上可以记住几个原则：第一，M7 管复杂业务，M4 管确定性任务；第二，共享内存宁可保守，也不要含糊；第三，Cache 一致性要么通过 MPU 规避，要么通过严格封装维护；第四，RPMsg 上层必须有协议，不要把它当成随手发送字节数组的管道；第五，调试信息要结构化，不能只依赖串口打印。</p>
<p>如果按这些原则设计，STM32H7 双核架构会非常好用。M7 可以放心承载界面、网络、文件和复杂算法，M4 则保持控制环和采样任务的稳定节奏。两者通过 OpenAMP / RPMsg 交换经过版本化和校验的消息，既能提升系统能力，也不会把实时性和可维护性牺牲掉。</p>
<p>（全文完，约6900字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
