Skip to content

feat(llama-cpp): make server-side prompt cache work by default#9925

Merged
mudler merged 1 commit into
masterfrom
fix/llama-cpp-prompt-cache-defaults
May 21, 2026
Merged

feat(llama-cpp): make server-side prompt cache work by default#9925
mudler merged 1 commit into
masterfrom
fix/llama-cpp-prompt-cache-defaults

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Fixes #9921. Repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) now hit the server-side prompt cache out of the box: no YAML changes required.

LocalAI's llama-cpp gRPC backend hardcodes `n_parallel=1` and was also hardcoding `kv_unified=false`, which silently force-disables `cache_idle_slots` at upstream's server init (server-context.cpp:1023-1034). The host prompt cache was being allocated (`cache_ram_mib=-1` = no limit) but never written across requests, so a 20-25k-token system prompt got re-prefilled on every call. Upstream sidesteps this by bumping `n_parallel` to 4 and flipping `kv_unified=true` when the user leaves slot count on auto (tools/server/server.cpp:100-105); LocalAI never hit that path.

This PR:

  • Flips `kv_unified` default to `true` in `backend/cpp/llama-cpp/grpc-server.cpp` (keeping `n_parallel=1` to avoid a slot-count behavior change). Single-slot setups now get prompt-cache hits across requests.
  • Bumps `n_ctx_checkpoints` default from 8 to 32 to match upstream.
  • Initializes `cache_idle_slots=true` and `checkpoint_every_nt=8192` explicitly (both match upstream defaults).
  • Exposes `cache_idle_slots` / `idle_slots_cache` and `checkpoint_every_nt` / `checkpoint_every_n_tokens` as new option keys so users can opt out or tune.
  • Fixes docs: the `cache_ram` description was wrong (it's the host-side prompt cache, not the KV cache). Documents the kv_unified + cache_ram + cache_idle_slots interaction, adds rows for the two newly-exposed options, and adds an explainer worked example for the repeated-system-prompt workload from the issue.
  • Marks the legacy `prompt_cache_path` / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the llama-cpp gRPC backend in `docs/content/advanced/model-configuration.md` (they target upstream's CLI completion tool and are not read by `grpc-server.cpp`) and point readers at the new prompt-cache explainer.

Test plan

  • `make -C backend/cpp/llama-cpp clean && make backends/llama-cpp` builds clean (done locally)
  • Smoke test: load any GGUF, send the same 5k+ token system prompt 3 times via `/v1/chat/completions`. Call 1 prefills; calls 2-3 should hit the warm prompt cache and finish in seconds. Check server log for `prompt cache is enabled` and `idle slots will be saved to prompt cache and cleared upon starting a new task`.
  • Verify `options: ["kv_unified:false"]` restores the old behavior (no idle-slot saving, prefill repeated).
  • Verify `options: ["cache_ram:0"]` disables the prompt cache entirely.
  • Verify `options: ["cache_idle_slots:false"]` keeps `kv_unified=true` but disables idle-slot saving.

Reported and analyzed by @pos-ei-don on a DGX Spark (GB10/ARM64, CUDA 13): 5-8 min per turn for Claude-Code-style sessions without the cache, collapsing to seconds with it on.

🤖 Generated with Claude Code

Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.

Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.

Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
  the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
  checkpoint_every_nt / checkpoint_every_n_tokens

Docs:
- features/text-generation.md: fix misleading `cache_ram` description
  (it's the host-side prompt cache, not the KV cache), document the
  kv_unified + cache_ram + cache_idle_slots interaction, add rows for
  the two newly-exposed options, and add a worked example for the
  agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
  / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
  llama-cpp gRPC backend (they target upstream's CLI completion tool
  and are not consumed by grpc-server.cpp) and point readers at the
  new prompt-cache explainer.

Closes #9921

Assisted-by: claude:opus-4.7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit 959de86 into master May 21, 2026
63 of 64 checks passed
@mudler mudler deleted the fix/llama-cpp-prompt-cache-defaults branch May 21, 2026 14:31
@localai-bot localai-bot added the enhancement New feature or request label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs: llama-cpp options 'cache_ram' + 'kv_unified' not documented (huge latency win for re-prompted system prompts)

2 participants