citeformer.backends.llamacpp

llama.cpp backend via llama-cpp-python.

llama.cpp has native GBNF support, which is exactly what our grammar builder emits — so the integration is a one-liner: hand the grammar string to LlamaGrammar.from_string and pass the result to Llama.__call__.

Requires the llamacpp extra: pip install citeformer[llamacpp]. You also need a GGUF model file on disk (llama-cpp-python consumes GGUF, not HuggingFace weights directly — use huggingface_hub or the convert-hf- to-gguf script in the llama.cpp repo to produce one).

Module Contents

Classes

LlamaCppBackend

llama.cpp backend with grammar-level citation enforcement.

API

class citeformer.backends.llamacpp.LlamaCppBackend(model_path: str, *, n_ctx: int = 2048, n_gpu_layers: int = -1, verbose: bool = False)

Bases: citeformer.backends.base.Backend

llama.cpp backend with grammar-level citation enforcement.

Wraps llama_cpp.Llama. Citation markers are constrained at decode time by LlamaGrammar.from_string(grammar.gbnf) — same GBNF string our builder emits for XGrammar, no translation layer needed.

Attributes: model_path: Local filesystem path to a GGUF model. n_ctx: Context window size. n_gpu_layers: How many layers to offload to GPU (Metal/CUDA). -1 for all, 0 for CPU-only. llm: The loaded llama_cpp.Llama instance.

Initialization

Load a GGUF model via llama-cpp-python.

Args: model_path: Filesystem path to a GGUF file. n_ctx: Context window size (tokens). Larger = more memory. n_gpu_layers: Layers to offload to GPU. -1 offloads all (fastest on Metal / CUDA); 0 runs on CPU only. verbose: Whether to print llama.cpp’s decoding diagnostics.

Raises: ImportError: If citeformer[llamacpp] extras aren’t installed.

model_path: str

None

n_ctx: int

None

n_gpu_layers: int

None

llm: Any

None

generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str

Generate text with llama.cpp’s GBNF-constrained decoder.

Args: prompt: User prompt. Caller assembles any RAG context. sources: Sources in scope (must be non-empty). policy: Citation enforcement policy. **options: Sampling + grammar overrides — max_new_tokens (default 256), temperature (default 0.7), max_content_chars (REQUIRED-policy progression bound; see ADR-009). Unknown keys ignored.

Returns: Generated text with only valid [N] markers.

stream(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) collections.abc.Iterator[str]

Stream text chunks from llama.cpp’s native streaming mode.

llama-cpp-python’s Llama.__call__(..., stream=True) yields dicts shaped like the non-streaming result: each has ["choices"][0]["text"] with the newly-decoded piece. Grammar enforcement is identical to generate().