`citeformer.backends.vllm`¶

vLLM backend with grammar-level citation enforcement.

vLLM supports multiple guided-decoding backends (xgrammar, outlines, lm-format-enforcer, llguidance). We pick XGrammar by default because (a) it’s vLLM’s default in 2026, (b) it’s what our HF backend already uses, so a user running the same grammar through both gets identical decode-time semantics.

Requires the vllm extra: pip install citeformer[vllm]. Linux with CUDA only. vLLM doesn’t ship macOS or Windows wheels as of April 2026, so this backend is excluded from the all extra and from the integration tests that run on non-Linux hosts.

Module Contents¶

Classes¶

VLLMBackend

vLLM backend with grammar-level citation enforcement.

API¶

class citeformer.backends.vllm.VLLMBackend(model: str, *, guided_decoding_backend: str = 'xgrammar', **llm_kwargs: Any)¶

Bases: citeformer.backends.base.Backend

vLLM backend with grammar-level citation enforcement.

Wraps vllm.LLM for offline batched generation. Uses XGrammar as the constrained-decoding backend by default; override via the guided_decoding_backend constructor kwarg ("llguidance" is the next-best choice for fast TTFT on simple grammars).

Attributes: model_name: HuggingFace model identifier. guided_decoding_backend: vLLM’s guided-decoding backend selector. llm: The loaded vllm.LLM instance.

Initialization

Load a model with vLLM.

Args: model: HuggingFace model identifier (or a local path vLLM can load). guided_decoding_backend: Constrained-decoding backend. Common choices: "xgrammar" (default), "llguidance", "outlines", "lm-format-enforcer". **llm_kwargs: Forwarded to vllm.LLM. Useful ones: dtype, tensor_parallel_size, gpu_memory_utilization, max_model_len, enforce_eager.

Raises: ImportError: If citeformer[vllm] extras aren’t installed (or not available on this platform — vLLM is Linux/CUDA only).

model_name: str¶: None

guided_decoding_backend: str¶: None

llm: Any¶: None

generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) → str¶

Generate text with vLLM + grammar-constrained decoding.

Args: prompt: User prompt. Caller assembles any RAG context. sources: Sources in scope (must be non-empty). policy: Citation enforcement policy. **options: Sampling + grammar overrides — max_new_tokens (default 256), temperature (default 0.7), max_content_chars (REQUIRED-policy progression bound; see ADR-009). Unknown keys ignored.

Returns: Generated text with only valid [N] markers.

citeformer.backends.vllm¶

Module Contents¶

Classes¶

API¶

`citeformer.backends.vllm`¶