citeformer.backends.openai

OpenAI backend — schema-level cite-id enforcement via Structured Outputs.

OpenAI’s response_format={"type": "json_schema", "strict": true, ...} API lets us hand the model a JSON schema where the source_id field is constrained to a literal enum of the in-scope source indices. When strict=true the API rejects any response whose source_id isn’t one of the enumerated integers — structurally equivalent, at the schema layer, to what XGrammar does at the logit layer for local backends.

Tier honesty:

  • Local backends (HF / vLLM / llama.cpp) enforce at the logit layer — a fabricated cite id is token-impossible to sample.

  • This backend enforces at the schema layer. The provider validates the assistant’s generation against the schema before returning it. Fabrication is structurally impossible in the returned payload, which is what matters for downstream consumers.

Requires the openai extra: pip install citeformer[openai].

Model requirements: the strict: true JSON-schema mode is supported on gpt-4o-2024-08-06, gpt-4o-mini, and every OpenAI model released after August 2024. Older models will return a 400 — we surface that directly rather than silently falling back to non-strict mode.

Module Contents

Classes

OpenAIBackend

OpenAI chat-completions backend with schema-level cite enforcement.

API

class citeformer.backends.openai.OpenAIBackend(model: str = _DEFAULT_MODEL, *, client: Any | None = None, async_client: Any | None = None, **client_kwargs: Any)

Bases: citeformer.backends.base.Backend

OpenAI chat-completions backend with schema-level cite enforcement.

Requests a structured response of the shape::

{
  "segments": [
    {"text": "A sentence.", "citations": [1, 2]},
    {"text": "Another one.", "citations": [3]}
  ]
}

where every citations[*] integer is enum-constrained to 1..N. The backend then flattens the segments back into a single string carrying the configured :class:~citeformer.core.MarkerStyle markers (default [N]) — so downstream code (Citeformer orchestrator, verify, render) sees the same shape as local-backend output.

Attributes: model: OpenAI model identifier (e.g. "gpt-4o-mini"). client: The authenticated openai.OpenAI client. last_usage: Token-usage payload from the most recent generate() call. None before the first call. The orchestrator threads this onto :attr:GenerationResult.usage.

Initialization

Create an OpenAI backend.

Args: model: Model id supporting strict JSON schema (gpt-4o-mini or later). client: Pre-built openai.OpenAI client used by :meth:generate / :meth:stream. If None, one is built lazily from client_kwargs on the first sync call (picks up OPENAI_API_KEY from env then). async_client: Pre-built openai.AsyncOpenAI client used by :meth:agenerate / :meth:astream (ADR-014). If None, one is built lazily from client_kwargs on the first async call. Sync-only callers don’t pay the async construction cost; async-only callers don’t pay the sync one. **client_kwargs: Forwarded to openai.OpenAI() / openai.AsyncOpenAI() when the respective client is None (base_url, api_key, organization, timeout, …). Useful for pointing at a compatible endpoint (Azure, local LiteLLM, Together, Anyscale).

model: str

None

last_usage: citeformer.core.TokenUsage | None

None

property client: Any

Lazy openai.OpenAI client used by the sync surface.

Built on first access from client_kwargs (or returns the constructor-supplied override). Async-only callers never trigger construction — important for tests that inject only an async_client without setting OPENAI_API_KEY.

property async_client: Any

Lazy openai.AsyncOpenAI client used by the async surface.

Built on first access from the same client_kwargs the sync client uses (so a backend pointing at base_url=... for OpenRouter / Fireworks / Together cascades correctly to the async client too). Sync-only callers never trigger construction.

generate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str

Generate text with schema-level citation constraint.

Args: prompt: User prompt; caller is responsible for RAG stitching. sources: Sources in scope — position (1-indexed) becomes the cite enum entry. policy: Citation policy (REQUIRED/AUTO/QUOTES_ONLY). Threaded into the schema’s description so the model sees the same enforcement intent it would under local decoding. **options: max_tokens (default 1024), temperature (default 0.7), marker_style (default BRACKET), system_prompt (additional system-role content prepended to the assembled citation instructions).

Returns: Flattened text carrying marker_style markers for every cited source, in document order.

Raises: ValueError: If sources is empty.

async agenerate(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) str

Native-async counterpart of :meth:generate (ADR-014).

Uses self.async_client (the lazy AsyncOpenAI) so concurrent callers don’t tie up executor threads on the SDK’s HTTP wait. The request shape, schema construction, segment flattening, and last_usage extraction are identical to the sync path — only the client call is awaited. Subclasses (OpenRouter / Fireworks / Together) inherit this unchanged; their _build_response_format / _augment_create_kwargs hooks fire from here just like in the sync path.

stream(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) collections.abc.Iterator[str]

Stream segments by yielding each finished sentence with its markers.

OpenAI’s streaming response-format path is richer than we need here — we’re not trying to surface per-token deltas, just complete sentence-level chunks as each segment is validated. The simpler implementation calls :meth:generate once and chunks its output on sentence boundaries so downstream consumers of Citeformer.stream still see multiple chunks.

async astream(prompt: str, sources: list[citeformer.core.Source], policy: citeformer.core.Policy, **options: Any) collections.abc.AsyncIterator[str]

Native-async counterpart of :meth:stream (ADR-014).

Awaits :meth:agenerate (uses the async client) and then yields the same sentence-chunked output the sync :meth:stream produces. Cascades to OpenRouter / Fireworks / Together since they don’t override stream either.