ADR-014 — Async surface (agenerate / astream) on Backend + Citeformer¶
Status: Accepted and implemented (2026-04-25).
Context¶
LangChain (FastAPI middleware), LlamaIndex (async_query), and most
modern RAG framework code is async-first. Wrapping every citeformer
call in asyncio.to_thread works but is awkward — and worse, callers
can’t compose multiple concurrent generations cleanly because the sync
SDK clients underneath block the executor thread for the whole request
duration.
Two ways to ship async:
Add
async defparallels alongside the sync surface — the bigger codebase, but lets backends with native async clients (OpenAI’sAsyncOpenAI, Anthropic’sAsyncAnthropic, Mistral’sAsyncMistral) actually scale concurrent calls without thread exhaustion.Provide a sync-only library and tell users to wrap with
asyncio.to_thread— smaller surface, fine for occasional calls, terrible for high-concurrency RAG pipelines.
The local backends (HFBackend, VLLMBackend, LlamaCppBackend) are
GPU-bound — there’s no concurrency win from async there; they’d just
get the to_thread fallback. The API backends are where the real win
lives.
Decision¶
Adopt option (1): add a parallel async surface end-to-end.
BackendABC gainsagenerate()andastream()with sensible defaults — both delegate to the sync methods viaasyncio.to_thread, so every existing backend works in async code without modification. Backends with native async clients override for genuine concurrency.Citeformerorchestrator gainsagenerate()(async, returnsGenerationResult) andastream()(sync call returning a newAsyncStreamingResultthat is async-iterable + hasawait stream.finalize()). Same parsing / rendering / usage / rich-citation threading as the sync path — the orchestrator is mostly the same code awaiting the backend instead of calling it.Native overrides land in this PR for
OpenAIBackendandAnthropicBackend— the two most-used API backends. Both lazy-build their async client on first async call (so existing sync-only users don’t pay theAsyncOpenAI()instantiation cost). The async client is constructed from the sameclient_kwargsthe sync client used — one configuration, two clients.OpenRouterBackend,FireworksBackend,TogetherBackendinheritOpenAIBackend.agenerate()/astream()for free — they subclass without touching the async path, so the cascade pulls them along.GeminiBackendandMistralBackenduse the to_thread default for now. Their SDKs both support async (google-genaiasync mode,mistralai.AsyncMistral) — flagged as a follow-up TODO in their module docstrings; not load-bearing because most users with high-concurrency needs reach for OpenAI/Anthropic anyway.Local backends (
HFBackend,VLLMBackend,LlamaCppBackend) keep the to_thread default forever. GPU-bound; concurrent generation contends for the same hardware. Async is for I/O concurrency, not compute concurrency.
Consequences¶
The
BackendABC grows two methods. Out-of-tree backends written against the v0.1 ABC still work — both new methods have concrete defaults, no abstract requirement.AsyncStreamingResultis a new public class symmetric withStreamingResult. Same surface (textproperty,finalize(), iterable) but async —__aiter__/__anext__instead of__iter__/__next__, andfinalize()isasync def.OpenAIBackend.async_clientbecomes a public attribute (lazy property) so subclasses and advanced users can introspect it.No §10 contract changed. The new methods produce identical
GenerationResultshape to the sync methods.Adds zero new test-time dependencies.
pytest-asynciois already in the dev extras (was already used for the few existing async tests), andasyncio_mode = "auto"inpyproject.tomlmeans async tests don’t need decoration.
What this does NOT include¶
Native async for Gemini / Mistral — easy follow-ups, but separate PRs to keep this one focused.
Async metadata fetchers (
fetch_crossrefetc.) — they’re already cached and called once per source, low-leverage.Async
verify()— NLI inference is GPU-bound; same argument as the local backends.
Why not async-everywhere with a sync fallback?¶
Considered: drop the sync surface entirely, recommend
asyncio.run(cf.agenerate(...)) for sync callers. Rejected because
Most existing RAG code is sync; the breaking-change cost is large.
asyncio.runin a sync function inside an event loop (jupyter, some test runners) explodes — sync callers expect a sync surface.The dual surface is ~80 lines of orchestrator + per-backend, much cheaper than the migration cost.
The sync surface is the right default; the async surface is the right addition.