citeformer.backends.llamacpp¶
llama.cpp backend via llama-cpp-python.
llama.cpp has native GBNF support, which is exactly what our grammar builder
emits — so the integration is a one-liner: hand the grammar string to
LlamaGrammar.from_string and pass the result to Llama.__call__.
Requires the llamacpp extra: pip install citeformer[llamacpp]. You
also need a GGUF model file on disk (llama-cpp-python consumes GGUF, not
HuggingFace weights directly — use huggingface_hub or the convert-hf- to-gguf script in the llama.cpp repo to produce one).
Module Contents¶
Classes¶
llama.cpp backend with grammar-level citation enforcement. |
API¶
- class citeformer.backends.llamacpp.LlamaCppBackend(model_path: str, *, n_ctx: int = 2048, n_gpu_layers: int = -1, verbose: bool = False)¶
Bases:
citeformer.backends.base.Backendllama.cpp backend with grammar-level citation enforcement.
Wraps
llama_cpp.Llama. Citation markers are constrained at decode time byLlamaGrammar.from_string(grammar.gbnf)— same GBNF string our builder emits for XGrammar, no translation layer needed.Attributes: model_path: Local filesystem path to a GGUF model. n_ctx: Context window size. n_gpu_layers: How many layers to offload to GPU (Metal/CUDA).
-1for all,0for CPU-only. llm: The loadedllama_cpp.Llamainstance.Initialization
Load a GGUF model via
llama-cpp-python.Args: model_path: Filesystem path to a GGUF file. n_ctx: Context window size (tokens). Larger = more memory. n_gpu_layers: Layers to offload to GPU.
-1offloads all (fastest on Metal / CUDA);0runs on CPU only. verbose: Whether to print llama.cpp’s decoding diagnostics.Raises: ImportError: If
citeformer[llamacpp]extras aren’t installed.- generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str¶
Generate text with llama.cpp’s GBNF-constrained decoder.
Args: prompt: User prompt. Caller assembles any RAG context. sources: Sources in scope (must be non-empty). policy: Citation enforcement policy. **options: Sampling + grammar overrides —
max_new_tokens(default 256),temperature(default 0.7),max_content_chars(REQUIRED-policy progression bound; see ADR-009). Unknown keys ignored.Returns: Generated text with only valid
[N]markers.
- stream(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) collections.abc.Iterator[str]¶
Stream text chunks from llama.cpp’s native streaming mode.
llama-cpp-python’sLlama.__call__(..., stream=True)yields dicts shaped like the non-streaming result: each has["choices"][0]["text"]with the newly-decoded piece. Grammar enforcement is identical togenerate().