citeformer.backends.base¶
Abstract backend interface.
Every concrete backend (HFBackend, VLLMBackend, LlamaCppBackend, MockBackend)
implements this ABC. The orchestration layer (Citeformer) is backend-agnostic and
delegates generation via this interface, keeping grammar-building and decoding logic
scoped to the backend that cares about them.
Async surface (ADR-014): Backend exposes agenerate() and astream() alongside
generate() / stream(). Concrete backends inherit asyncio.to_thread defaults
that delegate to the sync methods — every backend works in async code without any
override. API backends with native async clients (OpenAIBackend.agenerate uses
AsyncOpenAI, AnthropicBackend.agenerate uses AsyncAnthropic) override for
genuine concurrency under load.
Module Contents¶
Classes¶
Abstract backend for citeformer. |
API¶
- class citeformer.backends.base.Backend¶
Bases:
abc.ABCAbstract backend for citeformer.
Subclasses implement
generate()against a specific model runtime (HF transformers, vLLM, llama.cpp, plus theMockBackend). Constrained-decoding grammar construction is the backend’s responsibility — each runtime has a different native format (XGrammar object, GBNF string, etc.), so the sharedgrammar/builder.pymodule emits a backend-agnostic intermediate representation that each backend converts as needed.Subclasses may optionally override
stream()to yield chunks as the model decodes them. The default implementation falls back togenerate()and emits the full text as a single chunk — any backend works withCiteformer.stream(), but only overriding backends deliver true token-by-token behavior.Async surface (ADR-014):
agenerate()andastream()default to running the sync methods viaasyncio.to_thread. Backends with native async clients (OpenAIBackend,AnthropicBackend, and their subclasses) override these for genuine concurrency.- abstractmethod generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str¶
Generate text with citation markers constrained to the given sources.
The returned string contains inline
[N]markers whereNis a 1-indexed position intosources. On grammar-level-enforcing backends (HF, vLLM, llama.cpp via their constrained-decoding integration) emitting an[N]forN > len(sources)is token-impossible;MockBackendin tests just respects the contract by construction.Args: prompt: User prompt. The orchestration layer is responsible for constructing a retrieval-augmented prompt (stitching in source snippets); this method receives the final prompt string. sources: Sources in scope; position determines citation index. policy: Citation enforcement policy. **options: Backend-specific decoding options (e.g.
max_tokens,temperature,seed). Unknown options are silently ignored.Returns: The generated text with inline markers. References are not part of the backend output — the orchestration layer renders them separately.
- stream(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) collections.abc.Iterator[str]¶
Yield generation output as a stream of text chunks.
The default implementation is not a true stream: it calls
generate()and yields the full text as a single chunk. Backends that can produce token-by-token output (HF, llama.cpp) should override this to yield chunks as they’re decoded — the grammar constraints apply to every yielded chunk exactly as ingenerate().Args: prompt: See
generate(). sources: Seegenerate(). policy: Seegenerate(). **options: Seegenerate(). Most backends also accept streaming-specific hints here (e.g. XGrammar’s LogitsProcessor is already stateful, so no extra plumbing is needed).Yields: Text chunks in the order they’re produced. Joining all yielded chunks reconstructs what
generate()would have returned.
- async agenerate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str¶
Async counterpart of
generate(). See ADR-014.The default implementation runs
generate()on a worker thread viaasyncio.to_thread— correct on every backend but only frees the event loop while the SDK call is in flight. Backends with native async clients (OpenAIBackend,AnthropicBackend, and their subclasses) override this to useAsyncOpenAI/AsyncAnthropicfor genuine concurrency under load.Args: prompt: See
generate(). sources: Seegenerate(). policy: Seegenerate(). **options: Seegenerate().Returns: Same as
generate().
- async astream(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) collections.abc.AsyncIterator[str]¶
Async counterpart of
stream(). See ADR-014.The default implementation wraps the sync iterator from
stream()inasyncio.to_threadper chunk — yields control back to the event loop between chunks but is bounded by the sync generator’s blocking behaviour. Backends with native async streaming (OpenAIBackend,AnthropicBackend) override this for genuine concurrency.Yields: Same as
stream(), just async.