<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>性能优化 on Tech Snippets - 嵌入式技术笔记</title><link>https://tech-snippets.xyz/tags/%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/</link><description>Recent content in 性能优化 on Tech Snippets - 嵌入式技术笔记</description><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Wed, 06 May 2026 19:00:00 +0800</lastBuildDate><atom:link href="https://tech-snippets.xyz/tags/%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/index.xml" rel="self" type="application/rss+xml"/><item><title>RISC-V 向量扩展 (RVV) 原理与实战优化指南</title><link>https://tech-snippets.xyz/posts/riscv-rvv-principle-optimization-guide/</link><pubDate>Wed, 06 May 2026 19:00:00 +0800</pubDate><guid>https://tech-snippets.xyz/posts/riscv-rvv-principle-optimization-guide/</guid><description>前言 2020 年代，AI 算力的需求呈现出爆炸式增长。从大语言模型的推理，到计算机视觉的实时处理，再到科学计算的海量数据处理，计算领域对数据并行处理能力的需求从未如此迫切。传统的标量 CPU 虽然通用，但面对海量重复运算时显得力不从心；GPU 虽然并行能力强大，但功耗和延迟问题使其难以在嵌入式和端侧场景中广泛应用。
正是在这样的背景下，RISC-V 向量扩展（RISC-V Vector Extension，简称 RVV） 应运而生。作为 RISC-V 指令集架构的官方标准扩展，RVV 提供了一套灵活、可扩展的向量处理机制，能够以远低于 GPU 的功耗和延迟，实现高效的数据并行计算。从低功耗的 IoT 设备，到高性能的服务器 CPU，RVV 正在成为 RISC-V 生态中最具变革性的技术之一。
RVV 的设计哲学与传统的 SIMD 扩展（如 x86 的 SSE/AVX、ARM 的 NEON/SVE）有着本质的不同。它不是简单地固定宽度的向量寄存器堆，而是引入了运行时可配置向量长度、向量寄存器分组、掩码操作等一系列创新设计，使得同一份 RVV 代码能够在不同硬件实现上高效运行，真正实现了&amp;quot;一次编写，处处加速&amp;quot;。
本文将从底层原理出发，带你深入理解 RVV 1.0 规范的设计精髓，通过完整的代码示例，手把手教你掌握 RVV 编程和优化技巧。无论你是芯片架构师、系统工程师，还是想要在 RISC-V 平台上优化算法性能的开发者，这篇文章都会为你提供完整的知识体系和实战指南。
一、为什么我们需要向量扩展？ 在深入探讨 RVV 的具体细节之前，让我们先回答一个最基本的问题：为什么 CPU 需要向量扩展？
1.1 数据级并行的本质 现代计算任务中，绝大多数密集运算都具有一个共同的特征：对大量数据执行相同的操作。例如：
图像卷积：对每个像素点执行相同的乘加运算 矩阵乘法：大量的元素级乘累加操作 神经网络推理：张量之间的批量运算 信号处理：FFT、滤波等时域频域变换 这种&amp;quot;单指令，多数据&amp;quot;的模式，正是向量计算能够发挥巨大优势的场景。如果用传统的标量指令来处理这些任务，每个数据元素都需要取指、译码、执行一次，这会造成巨大的指令开销和控制开销。而向量指令可以在一条指令中处理数十甚至上百个数据元素，将指令吞吐量提升一个数量级。
1.2 传统 SIMD 的局限性 在 RVV 出现之前，主流 CPU 架构都有自己的 SIMD 扩展：
x86: SSE、AVX、AVX2、AVX-512，向量宽度从 128 位逐步增加到 512 位 ARM: NEON（128 位固定宽度）、SVE（可伸缩向量） MIPS: MSA（128 位向量） 这些传统 SIMD 扩展虽然在特定场景下表现出色，但普遍存在几个问题：</description><content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>2020 年代，AI 算力的需求呈现出爆炸式增长。从大语言模型的推理，到计算机视觉的实时处理，再到科学计算的海量数据处理，计算领域对<strong>数据并行处理能力</strong>的需求从未如此迫切。传统的标量 CPU 虽然通用，但面对海量重复运算时显得力不从心；GPU 虽然并行能力强大，但功耗和延迟问题使其难以在嵌入式和端侧场景中广泛应用。</p>
<p>正是在这样的背景下，<strong>RISC-V 向量扩展（RISC-V Vector Extension，简称 RVV）</strong> 应运而生。作为 RISC-V 指令集架构的官方标准扩展，RVV 提供了一套灵活、可扩展的向量处理机制，能够以远低于 GPU 的功耗和延迟，实现高效的数据并行计算。从低功耗的 IoT 设备，到高性能的服务器 CPU，RVV 正在成为 RISC-V 生态中最具变革性的技术之一。</p>
<p>RVV 的设计哲学与传统的 SIMD 扩展（如 x86 的 SSE/AVX、ARM 的 NEON/SVE）有着本质的不同。它不是简单地固定宽度的向量寄存器堆，而是引入了<strong>运行时可配置向量长度</strong>、<strong>向量寄存器分组</strong>、<strong>掩码操作</strong>等一系列创新设计，使得同一份 RVV 代码能够在不同硬件实现上高效运行，真正实现了&quot;一次编写，处处加速&quot;。</p>
<p>本文将从底层原理出发，带你深入理解 RVV 1.0 规范的设计精髓，通过完整的代码示例，手把手教你掌握 RVV 编程和优化技巧。无论你是芯片架构师、系统工程师，还是想要在 RISC-V 平台上优化算法性能的开发者，这篇文章都会为你提供完整的知识体系和实战指南。</p>
<p><img alt="RVV 架构概览" loading="lazy" src="/images/rvv-architecture-overview.svg"></p>
<h2 id="一为什么我们需要向量扩展">一、为什么我们需要向量扩展？</h2>
<p>在深入探讨 RVV 的具体细节之前，让我们先回答一个最基本的问题：<strong>为什么 CPU 需要向量扩展？</strong></p>
<h3 id="11-数据级并行的本质">1.1 数据级并行的本质</h3>
<p>现代计算任务中，绝大多数密集运算都具有一个共同的特征：<strong>对大量数据执行相同的操作</strong>。例如：</p>
<ul>
<li>图像卷积：对每个像素点执行相同的乘加运算</li>
<li>矩阵乘法：大量的元素级乘累加操作</li>
<li>神经网络推理：张量之间的批量运算</li>
<li>信号处理：FFT、滤波等时域频域变换</li>
</ul>
<p>这种&quot;单指令，多数据&quot;的模式，正是向量计算能够发挥巨大优势的场景。如果用传统的标量指令来处理这些任务，每个数据元素都需要取指、译码、执行一次，这会造成巨大的指令开销和控制开销。而向量指令可以在一条指令中处理数十甚至上百个数据元素，将指令吞吐量提升一个数量级。</p>
<h3 id="12-传统-simd-的局限性">1.2 传统 SIMD 的局限性</h3>
<p>在 RVV 出现之前，主流 CPU 架构都有自己的 SIMD 扩展：</p>
<ul>
<li><strong>x86</strong>: SSE、AVX、AVX2、AVX-512，向量宽度从 128 位逐步增加到 512 位</li>
<li><strong>ARM</strong>: NEON（128 位固定宽度）、SVE（可伸缩向量）</li>
<li><strong>MIPS</strong>: MSA（128 位向量）</li>
</ul>
<p>这些传统 SIMD 扩展虽然在特定场景下表现出色，但普遍存在几个问题：</p>
<p><strong>固定向量宽度的耦合</strong>：传统 SIMD 的向量宽度是硬件定义的，软件必须针对特定宽度编写代码。当硬件升级（如从 AVX2 升级到 AVX-512）时，软件需要重新编写才能利用更大的向量宽度。</p>
<p><strong>寄存器资源浪费</strong>：对于较小的数据类型（如 8 位、16 位），固定宽度的向量寄存器虽然能存放更多元素，但指令集往往缺乏灵活的类型转换和操作支持。</p>
<p><strong>代码可移植性差</strong>：不同架构、甚至同一架构的不同代之间，SIMD 指令集往往不兼容。为了在多平台上获得最佳性能，开发者需要维护多个版本的 SIMD 代码。</p>
<h3 id="13-rvv-的设计突破">1.3 RVV 的设计突破</h3>
<p>RVV 的设计者们深刻认识到了传统 SIMD 的这些问题，提出了一系列创新的设计理念：</p>
<ul>
<li>
<p><strong>运行时可配置向量长度</strong>：软件在运行时查询硬件支持的最大向量长度（VLEN），然后根据实际需求设置当前操作的向量长度（vl）。同一份代码在不同 VLEN 的硬件上都能正确运行。</p>
</li>
<li>
<p><strong>灵活的数据类型支持</strong>：向量元素宽度（SEW）可以在运行时配置，支持 8、16、32、64 位的整数和浮点数。</p>
</li>
<li>
<p><strong>向量寄存器分组（LMUL）</strong>：通过配置 LMUL 参数，可以将多个物理寄存器组合成一个逻辑寄存器组，在需要更多元素时动态扩展向量长度。</p>
</li>
<li>
<p><strong>内置掩码支持</strong>：每条向量指令都支持掩码操作，可以只处理向量中的部分元素，避免了传统 SIMD 中复杂的掩码处理逻辑。</p>
</li>
</ul>
<p>这些设计使得 RVV 兼具了<strong>高性能</strong>、<strong>灵活性</strong>和<strong>可移植性</strong>，代表了下一代向量处理架构的发展方向。</p>
<h2 id="二rvv-10-规范的设计哲学">二、RVV 1.0 规范的设计哲学</h2>
<p>RVV 1.0 规范于 2021 年正式冻结，标志着 RISC-V 向量扩展进入了稳定可用的阶段。理解 RVV 的设计哲学，是掌握 RVV 编程的第一步。</p>
<h3 id="21-向量长度的解耦vlen-vs-vl">2.1 向量长度的解耦：VLEN vs vl</h3>
<p>RVV 中最核心的创新，就是将<strong>硬件实现的向量长度</strong>与<strong>软件使用的向量长度</strong>完全解耦。</p>
<p><strong>VLEN（Vector Length）</strong>：硬件实现的每个向量寄存器的位数，是一个硬件参数，在不同芯片上可以有不同的取值。RVV 规范要求 VLEN ≥ 128 位，且必须是 2 的幂。常见的 VLEN 配置有 128、256、512、1024 位等。</p>
<p><strong>vl（vector length 寄存器）</strong>：软件在运行时设置的当前向量操作的元素个数。每次向量指令执行时，只会处理前 <code>vl</code> 个元素，超出的部分保持不变。</p>
<p>这种设计带来了巨大的灵活性：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 查询硬件支持的最大向量长度（以位为单位）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">uint32_t</span> <span class="n">vlen</span> <span class="o">=</span> <span class="nf">__riscv_vlenb</span><span class="p">()</span> <span class="o">*</span> <span class="mi">8</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 设置当前向量长度为 16 个元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="nf">vsetvl</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="n">SEW_32</span><span class="p">,</span> <span class="n">LMUL_1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 这条指令只会处理 16 个元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="n">vadd</span><span class="p">.</span><span class="nf">vv</span><span class="p">(</span><span class="n">v0</span><span class="p">,</span> <span class="n">v1</span><span class="p">,</span> <span class="n">v2</span><span class="p">);</span>
</span></span></code></pre></div><p>当软件设置的 <code>vl</code> 超过硬件能够支持的最大值时，硬件会自动将其截断为最大支持值。这意味着同一份代码在 VLEN=128 的芯片上可能每次处理 4 个 32 位元素，在 VLEN=512 的芯片上可能每次处理 16 个 32 位元素，但代码逻辑完全不需要修改。</p>
<h3 id="22-向量类型配置vtype-寄存器">2.2 向量类型配置：vtype 寄存器</h3>
<p>除了向量长度 <code>vl</code>，RVV 还通过 <code>vtype</code> 向量类型寄存器来控制向量操作的行为。<code>vtype</code> 包含以下关键字段：</p>
<ul>
<li><strong>SEW（Standard Element Width）</strong>：标准元素宽度，指定向量元素的大小，可以是 8、16、32、64 位</li>
<li><strong>LMUL（Vector Register Grouping Multiplier）</strong>：向量寄存器分组倍数，决定每个逻辑向量使用多少个物理寄存器</li>
<li><strong>vediv</strong>：向量元素宽度除数，用于窄化操作</li>
<li><strong>vta（Vector Tail Agnostic）</strong>：向量尾元素策略，指定超过 <code>vl</code> 的元素是否可以被硬件任意修改</li>
<li><strong>vma（Vector Mask Agnostic）</strong>：向量掩码策略，指定被掩码屏蔽的元素是否可以被硬件任意修改</li>
</ul>
<p>每次执行向量指令前，都需要通过 <code>vsetvl</code> 或 <code>vsetivli</code> 指令配置这些参数。这看起来增加了一条额外的指令，但实际上这为编译器和硬件提供了巨大的优化空间。</p>
<h3 id="23-向量寄存器堆v0---v31">2.3 向量寄存器堆：v0 - v31</h3>
<p>RVV 定义了 32 个向量寄存器 <code>v0</code> 到 <code>v31</code>，每个寄存器的宽度为 VLEN 位。这些向量寄存器在逻辑上是独立的，但通过 LMUL 参数可以组合成更大的逻辑寄存器组。</p>
<p>值得注意的是，<code>v0</code> 具有特殊的地位——它是默认的掩码寄存器。当一条向量指令需要使用掩码时，如果没有指定其他掩码寄存器，就会默认使用 <code>v0</code> 中的掩码值。</p>
<h2 id="三向量寄存器分组lmul机制详解">三、向量寄存器分组（LMUL）机制详解</h2>
<p>向量寄存器分组（LMUL）是 RVV 中最巧妙、也是最容易让人困惑的设计之一。理解 LMUL 的工作原理，是掌握 RVV 性能优化的关键。</p>
<p><img alt="RVV 向量寄存器分组" loading="lazy" src="/images/rvv-register-grouping.svg"></p>
<h3 id="31-lmul-的基本概念">3.1 LMUL 的基本概念</h3>
<p>LMUL（Vector Register Grouping Multiplier）决定了每个逻辑向量使用多少个物理向量寄存器。RVV 1.0 支持以下 LMUL 值：</p>
<ul>
<li><strong>分数 LMUL</strong>：1/8、1/4、1/2 —— 一个物理寄存器可以容纳多个逻辑向量</li>
<li><strong>整数 LMUL</strong>：1、2、4、8 —— 一个逻辑向量占用多个物理寄存器</li>
</ul>
<p>为什么需要这样的设计？这要从向量操作的两类需求说起：</p>
<ol>
<li><strong>需要更多元素</strong>：当处理大量相同类型的数据时，我们希望每次向量指令能处理尽可能多的元素，以摊薄指令开销。</li>
<li><strong>需要更多临时变量</strong>：当实现复杂算法时，我们需要更多的向量寄存器来存放中间结果，避免频繁的寄存器溢出。</li>
</ol>
<p>LMUL 机制正是为了在这两类需求之间取得平衡。</p>
<h3 id="32-vlmax-的计算">3.2 VLMAX 的计算</h3>
<p>对于给定的 SEW 和 LMUL，硬件能够支持的最大向量元素个数 VLMAX 的计算公式为：</p>
<pre tabindex="0"><code>VLMAX = (VLEN × LMUL) / SEW
</code></pre><p>让我们通过几个例子来理解这个公式（假设 VLEN = 256 位）：</p>
<table>
<thead>
<tr>
<th>LMUL</th>
<th>SEW</th>
<th>VLMAX 计算</th>
<th>结果</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>32 位</td>
<td>(256 × 1) / 32</td>
<td>8 元素</td>
<td>每个逻辑向量使用 1 个物理寄存器</td>
</tr>
<tr>
<td>4</td>
<td>32 位</td>
<td>(256 × 4) / 32</td>
<td>32 元素</td>
<td>每个逻辑向量使用 4 个物理寄存器</td>
</tr>
<tr>
<td>1/8</td>
<td>32 位</td>
<td>(256 × 1/8) / 32</td>
<td>1 元素</td>
<td>8 个逻辑向量共享 1 个物理寄存器</td>
</tr>
</tbody>
</table>
<p>可以看到，增大 LMUL 可以增加每次向量操作的元素个数，但代价是减少了可用的逻辑向量寄存器数量。例如，当 LMUL = 8 时，32 个物理寄存器只能提供 4 个逻辑向量寄存器（v0, v8, v16, v24）。</p>
<h3 id="33-lmul-的选择策略">3.3 LMUL 的选择策略</h3>
<p>在实际编程中，如何选择合适的 LMUL 值？这里有几个基本原则：</p>
<p><strong>优先使用 LMUL = 1</strong>：这是最通用的配置，既能获得合理的向量长度，又能保留完整的 32 个向量寄存器用于算法实现。</p>
<p><strong>计算密集型内核用大 LMUL</strong>：如果你的算法非常简单（如向量加法、点积），只需要很少的临时变量，那么使用 LMUL = 2 或 LMUL = 4 可以增加每次处理的元素数，提升计算吞吐量。</p>
<p><strong>复杂算法用 LMUL = 1 或分数 LMUL</strong>：如果算法需要很多中间结果，或者有很多依赖关系，那么保留更多的向量寄存器比增加向量长度更重要。分数 LMUL 特别适合处理需要大量临时变量的场景。</p>
<p><strong>根据数据类型调整</strong>：对于较小的数据类型（如 8 位），LMUL = 1 可能已经提供了足够的元素个数（VLEN=256 时是 32 个元素），不需要更大的 LMUL。</p>
<p>（第一部分完，约2300字）</p>
<h2 id="四rvv-指令集分类详解">四、RVV 指令集分类详解</h2>
<p>RVV 1.0 规范定义了超过 150 条向量指令，涵盖了向量计算的各个方面。按照功能分类，我们可以将这些指令分为以下几大类。</p>
<h3 id="41-向量配置指令">4.1 向量配置指令</h3>
<p>向量配置指令用于设置 <code>vl</code> 和 <code>vtype</code> 寄存器，是执行任何向量操作前的&quot;开场白&quot;。</p>
<p><strong>vsetvl（Vector Set Vector Length）</strong>：</p>
<pre tabindex="0"><code class="language-assembly" data-lang="assembly">vsetvl rd, rs1, vtypei
</code></pre><p>这条指令根据 <code>rs1</code> 中的请求长度和 <code>vtypei</code> 编码的向量类型，计算实际可用的向量长度，并将其写入 <code>rd</code> 和 <code>vl</code> 寄存器。</p>
<p>在 C intrinsic 中，对应的函数是：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">size_t</span> <span class="nf">vsetvl_e32m1</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">avl</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="nf">__riscv_vsetvl_e32m1</span><span class="p">(</span><span class="n">avl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><strong>vsetivli（Vector Set Immediate Vector Length）</strong>：</p>
<pre tabindex="0"><code class="language-assembly" data-lang="assembly">vsetivli rd, uimm, vtypei
</code></pre><p>这是 vsetvl 的立即数版本，适合请求长度是编译期常量的情况。</p>
<h3 id="42-向量-loadstore-指令">4.2 向量 Load/Store 指令</h3>
<p>RVV 的访存指令是其灵活性的重要体现，支持多种访存模式：</p>
<p><strong>单位步长访存（Unit-Stride）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 从连续的内存地址加载向量
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vle32_v_i32m1</span><span class="p">(</span><span class="k">const</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 将向量存储到连续的内存地址
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">void</span> <span class="nf">vse32_v_i32m1</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">value</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>这是最常用的访存模式，相当于传统 SIMD 的连续加载/存储。</p>
<p><strong>跨步访存（Strided）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 每隔 stride 个字节加载一个元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vlse32_v_i32m1</span><span class="p">(</span><span class="k">const</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">stride</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 每隔 stride 个字节存储一个元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">void</span> <span class="nf">vsse32_v_i32m1</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">stride</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">value</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>跨步访存非常适合处理矩阵的列访问，或者数组中特定间隔的元素。</p>
<p><strong>索引访存（Indexed）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 根据索引数组中的偏移量加载元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vluxei32_v_i32m1</span><span class="p">(</span><span class="k">const</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">vuint32m1_t</span> <span class="n">bindex</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 根据索引数组中的偏移量存储元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">void</span> <span class="nf">vsuxei32_v_i32m1</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">vuint32m1_t</span> <span class="n">bindex</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">value</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>索引访存支持对任意地址的数据进行 gather/scatter 操作，这是传统 SIMD 很难高效实现的功能。</p>
<p><strong>段访存（Segment）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 加载结构体数组（AoS）格式的数据
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">void</span> <span class="nf">vlsseg3e32_v_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="o">*</span><span class="n">v0</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="o">*</span><span class="n">v1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="o">*</span><span class="n">v2</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                         <span class="k">const</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">base</span><span class="p">,</span> <span class="kt">ptrdiff_t</span> <span class="n">stride</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>段访存指令可以一次性加载多个字段，完美支持结构体数组（AoS）的访问模式，避免了复杂的数据重排。</p>
<h3 id="43-向量算术运算指令">4.3 向量算术运算指令</h3>
<p>RVV 提供了完整的算术运算指令，支持整数和浮点数：</p>
<p><strong>向量-向量运算</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 向量加法
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vadd_vv_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 向量乘法
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vmul_vv_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 向量乘累加
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vmacc_vv_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">acc</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p><strong>向量-标量运算</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 向量加标量
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vadd_vx_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 向量乘标量
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vmul_vx_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p><strong>饱和运算</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 饱和加法（结果不会溢出）
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint8m1_t</span> <span class="nf">vadd_sat_vv_i8m1</span><span class="p">(</span><span class="kt">vint8m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint8m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>饱和运算在图像处理和信号处理中非常有用，可以避免溢出导致的 artifacts。</p>
<h3 id="44-向量比较与掩码指令">4.4 向量比较与掩码指令</h3>
<p>RVV 的每条指令都支持掩码操作，这是其重要特性之一。</p>
<p><strong>向量比较</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 向量相等比较，生成掩码
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vbool8_t</span> <span class="nf">vmseq_vv_i32m1_b8</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 向量大于比较
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vbool8_t</span> <span class="nf">vmsgt_vv_i32m1_b8</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>比较指令的结果是一个掩码向量，每个元素是一个布尔值。</p>
<p><strong>掩码化操作</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 带掩码的向量加法
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vadd_vv_i32m1_m</span><span class="p">(</span><span class="kt">vbool8_t</span> <span class="n">mask</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">maskedoff</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                           <span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>当 <code>mask</code> 中的元素为 true 时，执行对应的运算；为 false 时，保留 <code>maskedoff</code> 中的值（或根据 vma 策略处理）。</p>
<h3 id="45-向量置换与重排指令">4.5 向量置换与重排指令</h3>
<p>向量置换指令用于重新排列向量中的元素，这是很多算法中的关键操作。</p>
<p><strong>滑动窗口（Slide）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 将向量向右滑动，左边填充新元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vslide1up_vx_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">dest</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">src</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>滑动指令在实现 FIR 滤波器、滑动窗口求和等算法时非常高效。</p>
<p><strong>收集（Gather）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 根据索引向量收集元素
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vrgather_vv_i32m1</span><span class="p">(</span><span class="kt">vint32m1_t</span> <span class="n">op1</span><span class="p">,</span> <span class="kt">vuint32m1_t</span> <span class="n">op2</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p><strong>压缩（Compress）</strong>：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 根据掩码将向量中的有效元素压缩到一起
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">vint32m1_t</span> <span class="nf">vcompress_vm_i32m1</span><span class="p">(</span><span class="kt">vbool8_t</span> <span class="n">mask</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">dest</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                              <span class="kt">vint32m1_t</span> <span class="n">src</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>压缩指令在实现条件过滤等操作时非常有用。</p>
<h3 id="46-向量归约指令">4.6 向量归约指令</h3>
<p>归约指令将向量中的所有元素合并成一个标量结果。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="c1">// 向量求和归约
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">int32_t</span> <span class="nf">vredsum_vs_i32m1_i32</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">dst</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">vector</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                              <span class="kt">int32_t</span> <span class="n">scalar</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 向量最大值归约
</span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">int32_t</span> <span class="nf">vredmax_vs_i32m1_i32</span><span class="p">(</span><span class="kt">int32_t</span> <span class="n">dst</span><span class="p">,</span> <span class="kt">vint32m1_t</span> <span class="n">vector</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">                              <span class="kt">int32_t</span> <span class="n">scalar</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">vl</span><span class="p">);</span>
</span></span></code></pre></div><p>归约指令在实现点积、求和、找最值等操作时非常高效。</p>
<h2 id="五rvv-编程入门从汇编到-intrinsic">五、RVV 编程入门：从汇编到 Intrinsic</h2>
<p>理解了 RVV 的原理和指令集后，现在让我们通过实际的代码示例来学习 RVV 编程。</p>
<h3 id="51-环境准备">5.1 环境准备</h3>
<p>要编译 RVV 代码，你需要支持 RVV 1.0 的工具链。推荐使用：</p>
<ul>
<li><strong>GCC 13+</strong> 或 <strong>Clang 17+</strong>：编译器需要支持 <code>-march=rv64gcv</code> 或 <code>-march=rv32gcv</code></li>
<li><strong>QEMU 7+</strong>：用于在 x86 机器上模拟 RISC-V 环境</li>
<li><strong>Spike</strong>：RISC-V 官方指令集模拟器</li>
</ul>
<p>编译命令示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">riscv64-linux-gnu-gcc -march<span class="o">=</span>rv64gcv -O2 -o <span class="nb">test</span> test.c
</span></span></code></pre></div><h3 id="52-第一个-rvv-程序向量加法">5.2 第一个 RVV 程序：向量加法</h3>
<p>让我们从最简单的向量加法开始：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;riscv_vector.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp"></span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">vector_add</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1">// 设置向量长度：取剩余元素数和最大可用长度中的较小值
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">size_t</span> <span class="n">vl</span> <span class="o">=</span> <span class="nf">__riscv_vsetvl_e32m1</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="n">i</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 加载向量
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">vint32m1_t</span> <span class="n">va</span> <span class="o">=</span> <span class="nf">__riscv_vle32_v_i32m1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="kt">vint32m1_t</span> <span class="n">vb</span> <span class="o">=</span> <span class="nf">__riscv_vle32_v_i32m1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 向量加法
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">vint32m1_t</span> <span class="n">vc</span> <span class="o">=</span> <span class="nf">__riscv_vadd_vv_i32m1</span><span class="p">(</span><span class="n">va</span><span class="p">,</span> <span class="n">vb</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 存储结果
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="nf">__riscv_vse32_v_i32m1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vc</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="n">i</span> <span class="o">+=</span> <span class="n">vl</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int32_t</span> <span class="n">a</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int32_t</span> <span class="n">b</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">40</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">70</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="mi">100</span><span class="p">};</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int32_t</span> <span class="n">c</span><span class="p">[</span><span class="mi">10</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="nf">vector_add</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;%d &#34;</span><span class="p">,</span> <span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>这个程序的核心是一个循环，每次循环处理 <code>vl</code> 个元素。<code>vl</code> 的值由硬件在运行时决定——在 VLEN=128 位的硬件上，<code>vl</code> 最多是 4（128 / 32）；在 VLEN=512 位的硬件上，<code>vl</code> 最多是 16。但无论硬件如何，代码逻辑都不需要修改。</p>
<h3 id="53-汇编代码分析">5.3 汇编代码分析</h3>
<p>让我们看看编译器为 <code>vector_add</code> 函数生成的汇编代码（简化版）：</p>
<pre tabindex="0"><code class="language-assembly" data-lang="assembly">vector_add:
    li      a4,0
    j       .L2
.L3:
    slli    a5,a4,2
    add     a5,a0,a5
    vsetvli a5, a5, e32, m1, ta, ma  # 设置向量长度和类型
    vle32.v v1,0(a5)                  # 加载向量 a
    add     a5,a1,a4,slli 2
    vle32.v v2,0(a5)                  # 加载向量 b
    vadd.vv v1,v1,v2                  # 向量加法
    add     a5,a2,a4,slli 2
    vse32.v v1,0(a5)                  # 存储结果
    add     a4,a4,a3                  # 更新索引
.L2:
    bltu    a4,a3,.L3                 # 检查是否完成
    ret
</code></pre><p>可以看到，核心的向量操作只有几条指令，但这几条指令可以一次性处理多个元素。</p>
<h3 id="54-条件操作掩码的使用">5.4 条件操作：掩码的使用</h3>
<p>让我们看一个更复杂的例子——条件向量加法，只处理偶数索引的元素：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">vector_add_even</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="n">a</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">b</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">c</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kt">size_t</span> <span class="n">vl</span> <span class="o">=</span> <span class="nf">__riscv_vsetvl_e32m1</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="n">i</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="kt">vint32m1_t</span> <span class="n">va</span> <span class="o">=</span> <span class="nf">__riscv_vle32_v_i32m1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="kt">vint32m1_t</span> <span class="n">vb</span> <span class="o">=</span> <span class="nf">__riscv_vle32_v_i32m1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 创建索引向量
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">vuint32m1_t</span> <span class="n">vid</span> <span class="o">=</span> <span class="nf">__riscv_vid_v_u32m1</span><span class="p">(</span><span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="kt">vuint32m1_t</span> <span class="n">offset</span> <span class="o">=</span> <span class="nf">__riscv_vadd_vx_u32m1</span><span class="p">(</span><span class="n">vid</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 生成掩码：只保留偶数索引
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">vuint32m1_t</span> <span class="n">mod</span> <span class="o">=</span> <span class="nf">__riscv_vand_vx_u32m1</span><span class="p">(</span><span class="n">offset</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="kt">vbool8_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="nf">__riscv_vmseq_vx_u32m1_b8</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="c1">// 带掩码的向量加法
</span></span></span><span class="line"><span class="cl"><span class="c1"></span>        <span class="kt">vint32m1_t</span> <span class="n">vc</span> <span class="o">=</span> <span class="nf">__riscv_vadd_vv_i32m1_m</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">va</span><span class="p">,</span> <span class="n">va</span><span class="p">,</span> <span class="n">vb</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="nf">__riscv_vse32_v_i32m1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">c</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vc</span><span class="p">,</span> <span class="n">vl</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        
</span></span><span class="line"><span class="cl">        <span class="n">i</span> <span class="o">+=</span> <span class="n">vl</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>在这个例子中，我们使用 <code>vid</code> 指令生成元素索引，然后通过位运算和比较生成掩码，最后在掩码的控制下执行加法。这展示了 RVV 灵活的条件处理能力。</p>
<p>（第二部分完，约2400字）</p>
]]></content:encoded></item></channel></rss>