citeformer.backends.vllm¶
vLLM backend with grammar-level citation enforcement.
vLLM supports multiple guided-decoding backends (xgrammar, outlines,
lm-format-enforcer, llguidance). We pick XGrammar by default because
(a) it’s vLLM’s default in 2026, (b) it’s what our HF backend already uses,
so a user running the same grammar through both gets identical decode-time
semantics.
Requires the vllm extra: pip install citeformer[vllm]. Linux with
CUDA only. vLLM doesn’t ship macOS or Windows wheels as of April 2026, so
this backend is excluded from the all extra and from the integration
tests that run on non-Linux hosts.
Module Contents¶
Classes¶
vLLM backend with grammar-level citation enforcement. |
API¶
- class citeformer.backends.vllm.VLLMBackend(model: str, *, guided_decoding_backend: str = 'xgrammar', **llm_kwargs: Any)¶
Bases:
citeformer.backends.base.BackendvLLM backend with grammar-level citation enforcement.
Wraps
vllm.LLMfor offline batched generation. Uses XGrammar as the constrained-decoding backend by default; override via theguided_decoding_backendconstructor kwarg ("llguidance"is the next-best choice for fast TTFT on simple grammars).Attributes: model_name: HuggingFace model identifier. guided_decoding_backend: vLLM’s guided-decoding backend selector. llm: The loaded
vllm.LLMinstance.Initialization
Load a model with vLLM.
Args: model: HuggingFace model identifier (or a local path vLLM can load). guided_decoding_backend: Constrained-decoding backend. Common choices:
"xgrammar"(default),"llguidance","outlines","lm-format-enforcer". **llm_kwargs: Forwarded tovllm.LLM. Useful ones:dtype,tensor_parallel_size,gpu_memory_utilization,max_model_len,enforce_eager.Raises: ImportError: If
citeformer[vllm]extras aren’t installed (or not available on this platform — vLLM is Linux/CUDA only).- generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str¶
Generate text with vLLM + grammar-constrained decoding.
Args: prompt: User prompt. Caller assembles any RAG context. sources: Sources in scope (must be non-empty). policy: Citation enforcement policy. **options: Sampling + grammar overrides —
max_new_tokens(default 256),temperature(default 0.7),max_content_chars(REQUIRED-policy progression bound; see ADR-009). Unknown keys ignored.Returns: Generated text with only valid
[N]markers.