<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>AI 工具 on Tech Snippets - 嵌入式技术笔记</title>
    <link>https://tech-snippets.xyz/tags/ai-%E5%B7%A5%E5%85%B7/</link>
    <description>Recent content in AI 工具 on Tech Snippets - 嵌入式技术笔记</description>
    <generator>Hugo</generator>
    <language>zh-cn</language>
    <lastBuildDate>Sat, 06 Jun 2026 19:00:00 +0800</lastBuildDate>
    <atom:link href="https://tech-snippets.xyz/tags/ai-%E5%B7%A5%E5%85%B7/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>vLLM 本地大模型推理服务实战：从 OpenAI API 到吞吐、显存与延迟调优</title>
      <link>https://tech-snippets.xyz/posts/vllm-local-llm-inference-optimization-guide/</link>
      <pubDate>Sat, 06 Jun 2026 19:00:00 +0800</pubDate>
      <guid>https://tech-snippets.xyz/posts/vllm-local-llm-inference-optimization-guide/</guid>
      <description>一篇面向工程落地的 vLLM 本地推理服务指南，覆盖安装部署、OpenAI 兼容接口、PagedAttention 原理、压测方法、显存参数、生产化运维与常见故障。</description>
      <content:encoded><![CDATA[<h2 id="前言为什么本地推理服务会成为团队的基础设施">前言：为什么本地推理服务会成为团队的基础设施</h2>
<p>过去两年，很多团队已经把大模型从“能聊几句的玩具”推进到了真正的业务链路里：客服质检、代码助手、文档检索、知识库问答、BI 分析、研发自动化、设备运维助手，场景越来越具体，调用量也越来越稳定。这个阶段最容易遇到的矛盾是：单次体验看起来不错，但一旦多人同时使用，延迟、成本、限流、数据安全、模型版本控制都会变成工程问题。</p>
<p>如果只是给个人写一个脚本，直接调用云端 API 最省事；如果团队已经有私有数据、内网系统、稳定 QPS、固定模型和合规要求，本地推理服务就值得认真建设。它不是为了“完全替代云服务”，而是为了把一部分可控、可预测、可缓存、可审计的请求沉到自己的基础设施里：模型版本自己定，日志留在内网，显卡利用率自己优化，业务峰值也可以通过队列和降级策略来处理。</p>
<p>在这一类方案中，vLLM 是目前很常见的选择。它的优势并不是“启动一个模型”这么简单，而是围绕大模型在线推理做了系统级优化：OpenAI 兼容 API、连续批处理、PagedAttention、张量并行、流式输出、Prometheus 指标、较成熟的服务端参数。对于很多团队来说，vLLM 正好站在“研究代码”和“生产服务”之间：比手写 Transformers server 更接近生产，比完整平台又轻量许多。</p>
<p>本文不打算只列一组启动命令。我们会按工程落地的顺序讲清楚：怎样选择模型与硬件，如何启动 OpenAI 兼容服务，为什么 PagedAttention 对吞吐和显存很关键，压测时应该看哪些指标，常见参数如何调，最后再补上网关、监控、systemd、Docker Compose 和故障排查。读完以后，你应该能搭出一个可用的内网推理服务，并知道下一步该怎么把它调稳。</p>
<p><img alt="vLLM 推理服务架构图" loading="lazy" src="/images/vllm-inference-architecture.svg"></p>
<h2 id="一先把目标说清楚不是跑起来而是稳定地跑">一、先把目标说清楚：不是“跑起来”，而是“稳定地跑”</h2>
<p>很多本地大模型项目的第一步都很顺利：下载模型，装依赖，跑一个 demo，看见回复，大家都很兴奋。真正的问题通常在第二周出现：同事开始接入，输入长度不一样，输出长度不一样，有人跑 8K 上下文，有人开流式输出，还有人批量生成摘要。GPU 显存看起来还剩不少，但请求排队越来越长；某些请求首 token 等待十几秒；升级模型后，原来的参数突然不合适；日志里偶尔出现 CUDA OOM，却很难复现。</p>
<p>所以在搭 vLLM 前，建议先明确四个目标。</p>
<p>第一，服务对象是谁。是给内部研发少量调用，还是给业务系统持续调用？如果只是研发使用，优先保证灵活性；如果是业务链路，优先保证限流、监控、灰度和回滚。</p>
<p>第二，模型规模是多少。7B、14B、32B、70B 对显存和并行方式的要求完全不同。模型越大，单卡部署越困难，吞吐和延迟的权衡也越明显。不要只看参数量，还要看量化格式、上下文长度、是否需要多 LoRA、是否要跑 embedding 或 rerank。</p>
<p>第三，请求形态是什么。短问短答、长文摘要、代码生成、Agent 工具调用的 token 分布差别很大。Prefill 阶段主要处理输入 token，Decode 阶段逐 token 生成输出；输入特别长会拉高首 token 延迟，输出特别长会占用更久的 KV Cache。压测时如果只用“你好”这种请求，结果没有参考价值。</p>
<p>第四，接受什么样的服务等级。比如 P95 首 token 延迟小于 3 秒，平均输出速度大于每秒 40 token，排队超过 30 秒直接返回忙碌，单 GPU 显存利用率维持在 85% 左右。这些指标越早写下来，后面调参越不会靠感觉。</p>
<h2 id="二vllm-的核心价值连续批处理与-pagedattention">二、vLLM 的核心价值：连续批处理与 PagedAttention</h2>
<p>大模型推理和传统 Web 服务不太一样。一个 HTTP 请求进来以后，模型不是一次性算完，而是经历两个阶段：prefill 和 decode。Prefill 会把输入 prompt 送进模型，建立初始 KV Cache；decode 则每次生成一个 token，并把新的 KV 追加到缓存里。在线服务中，不同用户的请求长度不同、到达时间不同、生成长度也不同，如果按固定 batch 等齐所有请求，GPU 很容易空转；如果每个请求单独跑，吞吐又太低。</p>
<p>vLLM 的连续批处理解决的是“请求不断进来、不断完成”的调度问题。它不是把一批请求凑齐后一起跑到底，而是在每个调度步动态选择可执行的序列：有的请求刚进入 prefill，有的请求正在 decode，有的请求已经结束释放资源。这样可以让 GPU 更持续地工作，减少等待固定 batch 的浪费。</p>
<p>PagedAttention 则解决 KV Cache 的显存管理问题。LLM 生成过程中，每个序列都需要保存注意力所需的 KV 数据。传统做法容易为每个请求预留连续空间，长短请求混在一起时会造成显存碎片和浪费。PagedAttention 借鉴操作系统分页思想，把 KV Cache 切成块，以逻辑块到物理块的方式管理。这样短请求不会被迫占用过大的连续空间，长请求也可以按需扩展。对在线服务来说，这直接影响并发数、显存利用率和 OOM 风险。</p>
<p>简单理解：连续批处理决定 GPU 是否忙得起来，PagedAttention 决定显存是否用得细。二者叠加，才让 vLLM 相比手写推理循环更适合做服务。</p>
<h2 id="三环境准备从一台干净-gpu-服务器开始">三、环境准备：从一台干净 GPU 服务器开始</h2>
<p>下面以 Linux + NVIDIA GPU 为例。生产环境建议固定驱动、CUDA、Python 和 vLLM 版本，不要在业务高峰期临时升级依赖。最小化的准备工作包括：确认 GPU 可见，创建 Python 环境，安装 vLLM，下载模型，最后启动 OpenAI 兼容服务。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">nvidia-smi
</span></span><span class="line"><span class="cl">python3 --version
</span></span></code></pre></div><p>如果服务器上有多个 Python 项目，建议使用独立虚拟环境：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">python3 -m venv /opt/venvs/vllm
</span></span><span class="line"><span class="cl"><span class="nb">source</span> /opt/venvs/vllm/bin/activate
</span></span><span class="line"><span class="cl">python -m pip install --upgrade pip
</span></span><span class="line"><span class="cl">pip install vllm
</span></span></code></pre></div><p>模型可以从 Hugging Face 或内部镜像下载。生产环境最好把模型固定到本地路径，避免服务启动时依赖外网：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">mkdir -p /data/models
</span></span><span class="line"><span class="cl"><span class="c1"># 示例：提前通过 huggingface-cli 或内部制品库同步模型到 /data/models/Qwen2.5-7B-Instruct</span>
</span></span></code></pre></div><p>启动服务的最小命令如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">vllm serve /data/models/Qwen2.5-7B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --host 0.0.0.0 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --port <span class="m">8000</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --served-model-name qwen2.5-7b-instruct
</span></span></code></pre></div><p>启动后可以用 OpenAI SDK 或 curl 测试：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">curl http://127.0.0.1:8000/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{
</span></span></span><span class="line"><span class="cl"><span class="s1">    &#34;model&#34;: &#34;qwen2.5-7b-instruct&#34;,
</span></span></span><span class="line"><span class="cl"><span class="s1">    &#34;messages&#34;: [
</span></span></span><span class="line"><span class="cl"><span class="s1">      {&#34;role&#34;: &#34;system&#34;, &#34;content&#34;: &#34;你是一个严谨的工程助手。&#34;},
</span></span></span><span class="line"><span class="cl"><span class="s1">      {&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;用三句话解释 vLLM 的优势。&#34;}
</span></span></span><span class="line"><span class="cl"><span class="s1">    ],
</span></span></span><span class="line"><span class="cl"><span class="s1">    &#34;temperature&#34;: 0.3,
</span></span></span><span class="line"><span class="cl"><span class="s1">    &#34;max_tokens&#34;: 256,
</span></span></span><span class="line"><span class="cl"><span class="s1">    &#34;stream&#34;: false
</span></span></span><span class="line"><span class="cl"><span class="s1">  }&#39;</span>
</span></span></code></pre></div><p>如果这个请求能返回，就说明服务链路通了。但这一步只能证明“可用”，不能证明“可上线”。上线前还需要补三件事：压测、参数调优、服务治理。</p>
<p>（第一部分完，约2600字）</p>
<h2 id="四openai-兼容接口让业务少改代码">四、OpenAI 兼容接口：让业务少改代码</h2>
<p>vLLM 的一个实用优点是提供 OpenAI 兼容接口。很多业务系统已经按 <code>/v1/chat/completions</code>、<code>/v1/completions</code> 或 embedding 接口封装好了调用层，本地服务只要保持类似协议，就能用较低成本切换。通常业务侧只需要修改 <code>base_url</code>、<code>api_key</code> 和 <code>model</code> 名称。</p>
<p>Python 调用示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;http://127.0.0.1:8000/v1&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;local-dev-key&#34;</span><span class="p">,</span>  <span class="c1"># 如果前面没有网关鉴权，vLLM 本身可不校验该值</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">resp</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;qwen2.5-7b-instruct&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;你是一个懂 Linux 和推理优化的工程助手。&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;给我一份 vLLM 服务压测清单。&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="p">],</span>
</span></span><span class="line"><span class="cl">    <span class="n">temperature</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_tokens</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
</span></span></code></pre></div><p>流式输出也很重要。对用户界面来说，总生成时间可能是 20 秒，但如果首 token 2 秒内出现，体感会明显更好。流式调用示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">stream</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="s2">&#34;qwen2.5-7b-instruct&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;写一个 systemd 管理 vLLM 的例子。&#34;</span><span class="p">}],</span>
</span></span><span class="line"><span class="cl">    <span class="n">temperature</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">max_tokens</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">stream</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">delta</span> <span class="o">=</span> <span class="n">chunk</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">delta</span><span class="o">.</span><span class="n">content</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">delta</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s2">&#34;&#34;</span><span class="p">,</span> <span class="n">flush</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span></code></pre></div><p>这里有一个工程经验：UI 侧使用流式输出，不代表后端就可以忽略总耗时。流式只是改善感知延迟，GPU 资源仍然会被长输出占用。如果业务允许，应该给不同入口设置不同的 <code>max_tokens</code>，不要让所有请求默认生成 4096 token。</p>
<h2 id="五关键启动参数先理解再调大">五、关键启动参数：先理解，再调大</h2>
<p>vLLM 参数很多，但刚开始不需要全部碰。建议先关注以下几类。</p>
<h3 id="1-上下文长度">1. 上下文长度</h3>
<p><code>--max-model-len</code> 决定服务允许的最大上下文长度。上下文越长，KV Cache 占用越多，可并发请求越少。很多人喜欢一上来开 32K 或 64K，但实际业务里可能 90% 请求都低于 4K。除非确实需要长文档处理，否则先用较保守的长度，等压测证明需要再扩大。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">--max-model-len <span class="m">8192</span>
</span></span></code></pre></div><h3 id="2-显存水位">2. 显存水位</h3>
<p><code>--gpu-memory-utilization</code> 控制 vLLM 预期使用的 GPU 显存比例。默认值通常比较稳，但在单机单服务场景可以适当提高，比如 0.90 或 0.92。不要盲目拉满到 0.98，因为驱动、CUDA context、临时张量和监控进程也会占显存，水位过高会让 OOM 变得随机。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">--gpu-memory-utilization 0.90
</span></span></code></pre></div><h3 id="3-并发序列与批处理-token">3. 并发序列与批处理 token</h3>
<p><code>--max-num-seqs</code> 控制同时处理的序列数量上限，<code>--max-num-batched-tokens</code> 控制一个调度批次中的 token 上限。短请求高并发场景可以提高序列数量；长输入场景更受 batched tokens 影响。二者都不是越大越好，过大可能导致首 token 延迟上升，甚至显存压力增大。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">--max-num-seqs <span class="m">64</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>--max-num-batched-tokens <span class="m">8192</span>
</span></span></code></pre></div><h3 id="4-并行方式">4. 并行方式</h3>
<p>大模型放不进单卡时，可以使用张量并行：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">--tensor-parallel-size <span class="m">2</span>
</span></span></code></pre></div><p>张量并行会把模型切到多张 GPU 上，解决显存问题，但也带来跨卡通信开销。对于 7B、14B 模型，单卡能放下时未必需要并行；对于 32B、70B，通常需要多卡。不要把“多卡”直接等同于“更快”，实际速度取决于模型规模、互联带宽、batch 形态和调度参数。</p>
<h3 id="5-量化与-dtype">5. 量化与 dtype</h3>
<p>如果 GPU 显存紧张，可以考虑量化模型。量化会降低显存占用，提高可部署性，但可能影响输出质量和部分算子的性能表现。生产环境建议固定一组评测集，比较 FP16/BF16、AWQ、GPTQ 等不同格式在质量、吞吐和延迟上的变化，而不是只看能否加载。</p>
<p>一个比较稳妥的启动命令示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">vllm serve /data/models/Qwen2.5-14B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --host 0.0.0.0 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --port <span class="m">8000</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --served-model-name qwen2.5-14b-instruct <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --max-model-len <span class="m">8192</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --gpu-memory-utilization 0.90 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --max-num-seqs <span class="m">64</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span>  --max-num-batched-tokens <span class="m">8192</span>
</span></span></code></pre></div><h2 id="六压测方法不要只看-qps">六、压测方法：不要只看 QPS</h2>
<p>LLM 服务压测最常见的误区是只看 QPS。传统接口一次请求可能只查数据库、组装 JSON，QPS 很直观；LLM 推理的成本与输入 token、输出 token、并发、采样参数都有关系。两个请求都叫“一次调用”，一个输入 50 token 输出 50 token，另一个输入 6000 token 输出 2000 token，对 GPU 的压力完全不是一个量级。</p>
<p>建议至少记录以下指标：</p>
<ul>
<li>TTFT（Time To First Token）：首 token 延迟，影响用户体感；</li>
<li>TPOT（Time Per Output Token）：每个输出 token 的平均耗时；</li>
<li>End-to-End Latency：完整请求耗时；</li>
<li>Output Throughput：每秒输出 token 数；</li>
<li>Total Token Throughput：输入加输出的总 token 处理能力；</li>
<li>Queue Time：请求在服务端排队等待的时间；</li>
<li>GPU Utilization：GPU 计算利用率；</li>
<li>GPU Memory：显存占用与峰值；</li>
<li>Error Rate：超时、取消、OOM、限流比例。</li>
</ul>
<p>压测数据集要尽量接近真实业务。可以准备三组 prompt：短问答、普通知识库问答、长文摘要。每组都固定输入长度和目标输出长度，再分别测并发 1、4、8、16、32、64 的表现。压测时还要区分流式和非流式，因为业务层的超时策略可能不同。</p>
<p>示例压测脚本思路如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">asyncio</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">time</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">AsyncOpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">AsyncOpenAI</span><span class="p">(</span><span class="n">base_url</span><span class="o">=</span><span class="s2">&#34;http://127.0.0.1:8000/v1&#34;</span><span class="p">,</span> <span class="n">api_key</span><span class="o">=</span><span class="s2">&#34;local&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">PROMPT</span> <span class="o">=</span> <span class="s2">&#34;请用工程实践的角度解释 vLLM 的连续批处理，并给出调优建议。&#34;</span> <span class="o">*</span> <span class="mi">20</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">one_request</span><span class="p">(</span><span class="n">i</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">perf_counter</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">first</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">    <span class="n">out</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">stream</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">completions</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">model</span><span class="o">=</span><span class="s2">&#34;qwen2.5-7b-instruct&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">PROMPT</span><span class="p">}],</span>
</span></span><span class="line"><span class="cl">        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">temperature</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">async</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">stream</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">delta</span> <span class="o">=</span> <span class="n">chunk</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">delta</span><span class="o">.</span><span class="n">content</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">delta</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">first</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">first</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">perf_counter</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">            <span class="n">out</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">delta</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">t1</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">perf_counter</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;id&#34;</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;ttft&#34;</span><span class="p">:</span> <span class="kc">None</span> <span class="k">if</span> <span class="n">first</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">first</span> <span class="o">-</span> <span class="n">t0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;latency&#34;</span><span class="p">:</span> <span class="n">t1</span> <span class="o">-</span> <span class="n">t0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;chars&#34;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="s2">&#34;&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">out</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">concurrency</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">one_request</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">concurrency</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">,</span> <span class="n">return_exceptions</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">ok</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="nb">dict</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">ok</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">main</span><span class="p">(</span><span class="mi">16</span><span class="p">))</span>
</span></span></code></pre></div><p>这个脚本不算完整压测工具，但足够说明思路：首 token、完整耗时、输出长度都要记录。正式压测可以接入更完善的 benchmark 脚本，或者把结果写入 CSV，再用 Python 计算 P50、P90、P95、P99。</p>
<h2 id="七调参顺序先场景后参数">七、调参顺序：先场景，后参数</h2>
<p>我的建议是按下面顺序调，而不是看到一个参数就改一个参数。</p>
<p>第一步，固定模型、dtype、上下文长度。模型一变，所有结果都要重测；上下文长度一变，KV Cache 预算也会变。先把基础条件锁住。</p>
<p>第二步，用真实 prompt 做基线。并发从 1 开始，逐步升到目标值，记录 TTFT、吞吐、显存和错误率。这个基线非常重要，后面每次调参都要和它比较。</p>
<p>第三步，调整显存水位。观察 <code>--gpu-memory-utilization</code> 从 0.85 到 0.90、0.92 的变化。如果并发能力明显提升且没有 OOM，可以保留；如果只是让错误变随机，就退回。</p>
<p>第四步，调整 <code>max-num-seqs</code>。短请求、多用户场景通常受益于更高的序列并发；长请求场景则要小心队列膨胀和首 token 延迟。</p>
<p>第五步，调整 <code>max-num-batched-tokens</code>。这个参数会影响 prefill 批处理能力。长输入摘要、知识库问答、代码分析这类场景，适当提高可能有帮助；但如果请求大量短输出，提高太多未必收益明显。</p>
<p>第六步，设置业务侧限制。包括最大输入长度、最大输出长度、超时时间、用户级并发限制、任务级队列长度。很多 OOM 不是 vLLM 参数错了，而是业务层允许了“无限长输入 + 无限长输出 + 无限并发”。</p>
<h2 id="八网关鉴权与限流不要把-vllm-裸奔在内网">八、网关、鉴权与限流：不要把 vLLM 裸奔在内网</h2>
<p>即使只是内网服务，也不建议让业务直接打 vLLM 端口。更稳的方式是在前面放一层网关，比如 Nginx、Kong、Traefik 或自研 API Gateway。网关负责鉴权、限流、超时、请求体大小限制、日志脱敏和路由。vLLM 专注推理，不要让它承担所有平台职责。</p>
<p>一个 Nginx 反向代理示例：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-nginx" data-lang="nginx"><span class="line"><span class="cl"><span class="k">server</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kn">listen</span> <span class="mi">8080</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kn">client_max_body_size</span> <span class="mi">8m</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">location</span> <span class="s">/v1/</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_pass</span> <span class="s">http://127.0.0.1:8000/v1/</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_http_version</span> <span class="mi">1</span><span class="s">.1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">Connection</span> <span class="s">&#34;&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_set_header</span> <span class="s">Host</span> <span class="nv">$host</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_read_timeout</span> <span class="s">300s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_send_timeout</span> <span class="s">300s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>限流可以按用户、应用、模型分层处理。比如研发助手允许较长输出，线上客服只允许短输出；批处理摘要走异步队列，交互式聊天走同步流式接口；低优先级任务在 GPU 忙时直接排队或降级到小模型。这样做的好处是把“服务质量”变成可配置策略，而不是让所有请求在同一个队列里互相拖慢。</p>
<p>（第二部分完，约3100字）</p>
<h2 id="九docker-compose-与-systemd两种常见部署方式">九、Docker Compose 与 systemd：两种常见部署方式</h2>
<p>如果团队习惯容器化，可以用 Docker Compose 管理 vLLM。示例配置如下：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="nt">vllm</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">vllm/vllm-openai:latest</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">container_name</span><span class="p">:</span><span class="w"> </span><span class="l">vllm-qwen</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">restart</span><span class="p">:</span><span class="w"> </span><span class="l">unless-stopped</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">ipc</span><span class="p">:</span><span class="w"> </span><span class="l">host</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="s2">&#34;8000:8000&#34;</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">volumes</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">/data/models:/models:ro</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">environment</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span>- <span class="l">NVIDIA_VISIBLE_DEVICES=all</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">command</span><span class="p">:</span><span class="w"> </span><span class="p">&gt;</span><span class="sd">
</span></span></span><span class="line"><span class="cl"><span class="sd">      --model /models/Qwen2.5-7B-Instruct
</span></span></span><span class="line"><span class="cl"><span class="sd">      --served-model-name qwen2.5-7b-instruct
</span></span></span><span class="line"><span class="cl"><span class="sd">      --host 0.0.0.0
</span></span></span><span class="line"><span class="cl"><span class="sd">      --port 8000
</span></span></span><span class="line"><span class="cl"><span class="sd">      --max-model-len 8192
</span></span></span><span class="line"><span class="cl"><span class="sd">      --gpu-memory-utilization 0.90
</span></span></span><span class="line"><span class="cl"><span class="sd">      --max-num-seqs 64
</span></span></span><span class="line"><span class="cl"><span class="sd">      --max-num-batched-tokens 8192</span><span class="w">      
</span></span></span><span class="line"><span class="cl"><span class="w">    </span><span class="nt">deploy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">      </span><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">        </span><span class="nt">reservations</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">          </span><span class="nt">devices</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">            </span>- <span class="nt">driver</span><span class="p">:</span><span class="w"> </span><span class="l">nvidia</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l">all</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">              </span><span class="nt">capabilities</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">gpu]</span><span class="w">
</span></span></span></code></pre></div><p>容器化的优点是依赖固定、迁移方便；缺点是 GPU 驱动、NVIDIA Container Toolkit、共享内存、镜像版本都要管好。如果是单机内网服务，systemd 也很实用：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Unit]</span>
</span></span><span class="line"><span class="cl"><span class="na">Description</span><span class="o">=</span><span class="s">vLLM OpenAI Compatible Server</span>
</span></span><span class="line"><span class="cl"><span class="na">After</span><span class="o">=</span><span class="s">network-online.target</span>
</span></span><span class="line"><span class="cl"><span class="na">Wants</span><span class="o">=</span><span class="s">network-online.target</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Service]</span>
</span></span><span class="line"><span class="cl"><span class="na">Type</span><span class="o">=</span><span class="s">simple</span>
</span></span><span class="line"><span class="cl"><span class="na">User</span><span class="o">=</span><span class="s">vllm</span>
</span></span><span class="line"><span class="cl"><span class="na">Group</span><span class="o">=</span><span class="s">vllm</span>
</span></span><span class="line"><span class="cl"><span class="na">WorkingDirectory</span><span class="o">=</span><span class="s">/data</span>
</span></span><span class="line"><span class="cl"><span class="na">Environment</span><span class="o">=</span><span class="s">&#34;PATH=/opt/venvs/vllm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin&#34;</span>
</span></span><span class="line"><span class="cl"><span class="na">ExecStart</span><span class="o">=</span><span class="s">/opt/venvs/vllm/bin/vllm serve /data/models/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000 --served-model-name qwen2.5-7b-instruct --max-model-len 8192 --gpu-memory-utilization 0.90 --max-num-seqs 64 --max-num-batched-tokens 8192</span>
</span></span><span class="line"><span class="cl"><span class="na">Restart</span><span class="o">=</span><span class="s">always</span>
</span></span><span class="line"><span class="cl"><span class="na">RestartSec</span><span class="o">=</span><span class="s">5</span>
</span></span><span class="line"><span class="cl"><span class="na">LimitNOFILE</span><span class="o">=</span><span class="s">1048576</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Install]</span>
</span></span><span class="line"><span class="cl"><span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</span></span></code></pre></div><p>部署后用下面命令管理：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">systemctl daemon-reload
</span></span><span class="line"><span class="cl">systemctl <span class="nb">enable</span> --now vllm
</span></span><span class="line"><span class="cl">systemctl status vllm
</span></span><span class="line"><span class="cl">journalctl -u vllm -f
</span></span></code></pre></div><p>无论使用哪种方式，都建议把启动命令写进配置文件，而不是靠 SSH 历史记录。模型路径、端口、参数、版本都应该能被审计和回滚。</p>
<h2 id="十监控指标看见问题才谈得上优化">十、监控指标：看见问题，才谈得上优化</h2>
<p>vLLM 支持导出指标，实际接入时可以用 Prometheus 抓取，再用 Grafana 展示。监控面板不需要一开始就很花哨，先把下面几类做出来：</p>
<ul>
<li>请求速率：每分钟请求数、成功数、失败数；</li>
<li>延迟分布：TTFT、整体延迟、P50/P95/P99；</li>
<li>token 吞吐：输入 token、输出 token、总 token；</li>
<li>队列情况：等待中的请求数、排队时间；</li>
<li>GPU 状态：利用率、显存占用、温度、功耗；</li>
<li>服务状态：进程重启次数、错误日志数量、接口 5xx。</li>
</ul>
<p>监控的目的不是为了“看起来专业”，而是为了回答几个具体问题：慢是因为排队，还是因为单请求太长？显存高是正常缓存，还是泄漏和碎片？GPU 利用率低是因为 batch 太小，还是业务请求本来就少？P95 上升是模型变慢，还是某个用户提交了超长 prompt？</p>
<p>建议日志里记录 request id、用户或应用标识、模型名、输入 token、输出 token、开始时间、结束时间、错误类型。注意不要把敏感 prompt 原文随意写入日志；如果必须留样本，也要脱敏和分级授权。</p>
<h2 id="十一常见故障与排查思路">十一、常见故障与排查思路</h2>
<h3 id="1-cuda-oom">1. CUDA OOM</h3>
<p>OOM 的第一反应不应该是“换更大显卡”，而是先查四件事：模型是否过大，上下文长度是否过高，<code>gpu-memory-utilization</code> 是否过激，业务是否允许了过多并发或超长输出。临时处理可以降低 <code>max-model-len</code>、降低 <code>max-num-seqs</code>、降低最大输出 token，或者换量化模型。长期处理则要根据真实 token 分布重新规划容量。</p>
<h3 id="2-首-token-很慢">2. 首 token 很慢</h3>
<p>首 token 慢通常和输入长度、排队时间、prefill 压力有关。先区分是服务端排队，还是模型计算本身慢。如果并发一高 TTFT 就明显上升，说明调度压力较大，可以调整 <code>max-num-batched-tokens</code>、限制超长输入，或者把批处理任务和交互任务拆成两个服务。</p>
<h3 id="3-gpu-利用率不高但请求仍然慢">3. GPU 利用率不高但请求仍然慢</h3>
<p>这类情况可能是请求太碎、网关或客户端读取慢、CPU 预处理成为瓶颈、跨卡通信效率差，或者监控采样没有反映瞬时负载。不要只盯着 <code>nvidia-smi</code> 的利用率数字，最好结合 token throughput 和服务端队列看。</p>
<h3 id="4-输出质量和离线测试不一致">4. 输出质量和离线测试不一致</h3>
<p>检查聊天模板、system prompt、temperature、top_p、max_tokens、stop words、模型版本是否一致。本地服务为了兼容 OpenAI 接口，业务层可能对 messages 做了封装；一旦模板不一致，输出风格和质量都会变。</p>
<h3 id="5-服务偶发卡死或重启">5. 服务偶发卡死或重启</h3>
<p>先看 <code>journalctl</code>、容器日志、dmesg 和 GPU Xid 错误。驱动问题、电源问题、显存水位过高、依赖版本不兼容都可能导致偶发故障。生产环境建议固定镜像和驱动版本，并在升级前用同一套压测集跑回归。</p>
<h2 id="十二容量规划用-token-预算而不是拍脑袋">十二、容量规划：用 token 预算而不是拍脑袋</h2>
<p>LLM 服务容量规划可以从 token 预算开始。假设一个业务入口平均输入 1200 token，平均输出 400 token，峰值每分钟 300 次请求，那么每分钟要处理约 48 万 token。再结合压测得到的单机 token throughput，就可以估算需要多少 GPU 实例。当然，真实情况还要考虑 P95、峰谷、长尾请求、重试和模型切换。</p>
<p>一个粗略公式是：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">峰值 token / 秒 = 峰值请求数 / 秒 × (平均输入 token + 平均输出 token)
</span></span><span class="line"><span class="cl">所需实例数 = 峰值 token / 秒 ÷ 单实例可稳定 token / 秒 ÷ 安全系数
</span></span></code></pre></div><p>安全系数建议至少留 30% 到 50%。推理服务不是离线批处理，不能长期跑在极限吞吐上；否则一旦有长 prompt 或异常重试，排队会迅速放大。</p>
<h2 id="十三一个可落地的上线清单">十三、一个可落地的上线清单</h2>
<p>上线前可以按下面清单逐项确认：</p>
<ol>
<li>模型路径固定，模型版本可追溯；</li>
<li>vLLM、CUDA、驱动、Python 版本记录清楚；</li>
<li>启动参数写入 systemd 或 Compose，不依赖手工命令；</li>
<li><code>/v1/chat/completions</code> 流式与非流式都测试通过；</li>
<li>压测覆盖短、中、长三类 prompt；</li>
<li>设置最大输入长度、最大输出 token、请求超时；</li>
<li>网关层具备鉴权、限流、请求体大小限制；</li>
<li>监控覆盖延迟、吞吐、队列、错误率和 GPU 状态；</li>
<li>日志能按 request id 排查，但不泄露敏感数据；</li>
<li>预留回滚方案，可以快速切回旧模型或旧参数。</li>
</ol>
<p>如果这十项都做到，即使服务规模不大，也已经比“直接起一个端口给大家用”可靠很多。</p>
<h2 id="十四进阶方向多模型lora-与路由策略">十四、进阶方向：多模型、LoRA 与路由策略</h2>
<p>当一个 vLLM 服务稳定以后，下一步通常会遇到多模型问题。不同业务可能需要不同能力：客服要低延迟，代码助手要长上下文，知识库问答要稳定遵循格式，批量摘要要吞吐优先。把所有请求都塞给一个最大模型，既贵又慢。更合理的方式是做模型路由：简单问题走小模型，复杂问题走大模型；交互请求走低延迟实例，批处理请求走吞吐实例；高优先级用户有独立配额，低优先级任务可以排队。</p>
<p>LoRA 也是常见需求。它可以让同一个基础模型加载不同业务适配权重，减少多份模型带来的显存浪费。不过 LoRA 的管理、热加载、质量评估和隔离策略都需要额外设计。不要在没有评测和回滚机制的情况下，把多个业务 LoRA 混到同一个生产实例里。</p>
<p>再往后，可以建设统一的 LLM Gateway：对上提供统一 OpenAI 兼容接口，对下管理 vLLM、云 API、embedding、rerank、小模型和缓存。业务只关心模型能力和 SLA，平台负责路由、限流、审计、成本和观测。这时 vLLM 就不再是一个单独命令，而是推理基础设施的一部分。</p>
<h2 id="总结">总结</h2>
<p>vLLM 的价值不只是“把本地模型变成 API”。它真正解决的是在线推理中的几个硬问题：不同长度请求如何连续调度，KV Cache 如何高效管理，OpenAI 兼容接口如何降低接入成本，服务端参数如何在吞吐、延迟和显存之间取得平衡。</p>
<p>落地时要避免两个极端：一个极端是只看 demo，觉得能回复就能上线；另一个极端是过早追求复杂平台，迟迟不交付。更务实的路线是：先选定模型和硬件，启动 vLLM OpenAI 兼容服务；用真实 prompt 做压测，记录 TTFT、吞吐、显存和错误率；再按显存水位、并发序列、批处理 token、上下文长度逐项调参；最后补上网关、限流、监控、日志和回滚。</p>
<p>对于大多数团队来说，一套稳定的本地推理服务会逐渐变成 AI 应用的底座。它不一定替代所有云端能力，但能承接那些高频、敏感、可控的请求，让业务在成本、性能和安全之间有更多主动权。vLLM 正是搭建这类底座时值得优先尝试的工具。</p>
<p>（全文完，约7200字）</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
