feat(llama-cpp): make server-side prompt cache work by default by localai-bot · Pull Request #9925 · mudler/LocalAI

localai-bot · 2026-05-21T14:00:58Z

Summary

Fixes #9921. Repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) now hit the server-side prompt cache out of the box: no YAML changes required.

LocalAI's llama-cpp gRPC backend hardcodes `n_parallel=1` and was also hardcoding `kv_unified=false`, which silently force-disables `cache_idle_slots` at upstream's server init (server-context.cpp:1023-1034). The host prompt cache was being allocated (`cache_ram_mib=-1` = no limit) but never written across requests, so a 20-25k-token system prompt got re-prefilled on every call. Upstream sidesteps this by bumping `n_parallel` to 4 and flipping `kv_unified=true` when the user leaves slot count on auto (tools/server/server.cpp:100-105); LocalAI never hit that path.

This PR:

Flips `kv_unified` default to `true` in `backend/cpp/llama-cpp/grpc-server.cpp` (keeping `n_parallel=1` to avoid a slot-count behavior change). Single-slot setups now get prompt-cache hits across requests.
Bumps `n_ctx_checkpoints` default from 8 to 32 to match upstream.
Initializes `cache_idle_slots=true` and `checkpoint_every_nt=8192` explicitly (both match upstream defaults).
Exposes `cache_idle_slots` / `idle_slots_cache` and `checkpoint_every_nt` / `checkpoint_every_n_tokens` as new option keys so users can opt out or tune.
Fixes docs: the `cache_ram` description was wrong (it's the host-side prompt cache, not the KV cache). Documents the kv_unified + cache_ram + cache_idle_slots interaction, adds rows for the two newly-exposed options, and adds an explainer worked example for the repeated-system-prompt workload from the issue.
Marks the legacy `prompt_cache_path` / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the llama-cpp gRPC backend in `docs/content/advanced/model-configuration.md` (they target upstream's CLI completion tool and are not read by `grpc-server.cpp`) and point readers at the new prompt-cache explainer.

Test plan

`make -C backend/cpp/llama-cpp clean && make backends/llama-cpp` builds clean (done locally)
Smoke test: load any GGUF, send the same 5k+ token system prompt 3 times via `/v1/chat/completions`. Call 1 prefills; calls 2-3 should hit the warm prompt cache and finish in seconds. Check server log for `prompt cache is enabled` and `idle slots will be saved to prompt cache and cleared upon starting a new task`.
Verify `options: ["kv_unified:false"]` restores the old behavior (no idle-slot saving, prefill repeated).
Verify `options: ["cache_ram:0"]` disables the prompt cache entirely.
Verify `options: ["cache_idle_slots:false"]` keeps `kv_unified=true` but disables idle-slot saving.

Reported and analyzed by @pos-ei-don on a DGX Spark (GB10/ARM64, CUDA 13): 5-8 min per turn for Claude-Code-style sessions without the cache, collapsing to seconds with it on.

🤖 Generated with Claude Code

Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) skip prefill on subsequent calls without any YAML changes. Reported in #9921. Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4) when slot count is auto, which unlocks `cache_idle_slots`. LocalAI hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`, which silently force-disables idle-slot saving at server init. The host prompt cache was allocated but never written across requests. Changes in backend/cpp/llama-cpp/grpc-server.cpp: - params.kv_unified: false -> true (single-slot path now benefits from the prompt cache; users can opt out with `kv_unified:false`) - params.n_ctx_checkpoints: 8 -> 32 (match upstream default) - params.cache_idle_slots = true initialized explicitly (upstream default) - params.checkpoint_every_nt = 8192 initialized explicitly (upstream default) - New option parsers: cache_idle_slots / idle_slots_cache, checkpoint_every_nt / checkpoint_every_n_tokens Docs: - features/text-generation.md: fix misleading `cache_ram` description (it's the host-side prompt cache, not the KV cache), document the kv_unified + cache_ram + cache_idle_slots interaction, add rows for the two newly-exposed options, and add a worked example for the agent/CLI workload from the issue. - advanced/model-configuration.md: mark the legacy `prompt_cache_path` / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the llama-cpp gRPC backend (they target upstream's CLI completion tool and are not consumed by grpc-server.cpp) and point readers at the new prompt-cache explainer. Closes #9921 Assisted-by: claude:opus-4.7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler merged commit 959de86 into master May 21, 2026
63 of 64 checks passed

mudler deleted the fix/llama-cpp-prompt-cache-defaults branch May 21, 2026 14:31

localai-bot added the enhancement New feature or request label May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama-cpp): make server-side prompt cache work by default#9925

feat(llama-cpp): make server-side prompt cache work by default#9925
mudler merged 1 commit into
masterfrom
fix/llama-cpp-prompt-cache-defaults

localai-bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 21, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants