citeformer.backends.vllm

vLLM backend with grammar-level citation enforcement.

vLLM supports multiple guided-decoding backends (xgrammar, outlines, lm-format-enforcer, llguidance). We pick XGrammar by default because (a) it’s vLLM’s default in 2026, (b) it’s what our HF backend already uses, so a user running the same grammar through both gets identical decode-time semantics.

Requires the vllm extra: pip install citeformer[vllm]. Linux with CUDA only. vLLM doesn’t ship macOS or Windows wheels as of April 2026, so this backend is excluded from the all extra and from the integration tests that run on non-Linux hosts.

Module Contents

Classes

VLLMBackend

vLLM backend with grammar-level citation enforcement.

API

class citeformer.backends.vllm.VLLMBackend(model: str, *, guided_decoding_backend: str = 'xgrammar', **llm_kwargs: Any)

Bases: citeformer.backends.base.Backend

vLLM backend with grammar-level citation enforcement.

Wraps vllm.LLM for offline batched generation. Uses XGrammar as the constrained-decoding backend by default; override via the guided_decoding_backend constructor kwarg ("llguidance" is the next-best choice for fast TTFT on simple grammars).

Attributes: model_name: HuggingFace model identifier. guided_decoding_backend: vLLM’s guided-decoding backend selector. llm: The loaded vllm.LLM instance.

Initialization

Load a model with vLLM.

Args: model: HuggingFace model identifier (or a local path vLLM can load). guided_decoding_backend: Constrained-decoding backend. Common choices: "xgrammar" (default), "llguidance", "outlines", "lm-format-enforcer". **llm_kwargs: Forwarded to vllm.LLM. Useful ones: dtype, tensor_parallel_size, gpu_memory_utilization, max_model_len, enforce_eager.

Raises: ImportError: If citeformer[vllm] extras aren’t installed (or not available on this platform — vLLM is Linux/CUDA only).

model_name: str

None

guided_decoding_backend: str

None

llm: Any

None

generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str

Generate text with vLLM + grammar-constrained decoding.

Args: prompt: User prompt. Caller assembles any RAG context. sources: Sources in scope (must be non-empty). policy: Citation enforcement policy. **options: Sampling + grammar overrides — max_new_tokens (default 256), temperature (default 0.7), max_content_chars (REQUIRED-policy progression bound; see ADR-009). Unknown keys ignored.

Returns: Generated text with only valid [N] markers.