Architecture¶

citeformer is deliberately thin. The hard technical work — token masking, CSL rendering, PDF extraction, NLI — already lives in well-maintained dependencies. Our job is to compose them behind a single honest API.

Six-layer dependency order¶

CLI → orchestration (Citeformer) → verify → render → backends → grammar → core

Upper layers depend only on lower. A render module must never import from backends; a backend must never reach up into orchestration. Break this and the refactor radius explodes.

Piggyback-first¶

Before writing new code, ask: is this already done by one of these?

We piggyback on	For
XGrammar / llguidance	Grammar-level token masking at generation time
transformers (HF)	Running local causal LMs
vLLM	High-throughput inference with `--guided-decoding-backend`
llama.cpp (`llama-cpp-python`)	CPU / Apple Silicon inference with GBNF grammars
openai / anthropic / google-genai / mistralai	API-provider generation clients (the `openai` SDK is also the wire client for OpenRouter)
lark	Authoring the citation grammar before handing off to the decoder
httpx + diskcache	Metadata fetchers (Crossref, arXiv) with polite caching
pypdf / grobid-client-python	PDF text extraction — pypdf default, GROBID opt-in for cleaner scientific-paper parsing
readability-lxml	URL extraction
DeBERTa-v3-MNLI (via transformers)	NLI entailment for `verify()`
pydantic + typer + rich	Types, CLI, pretty output

The parts citeformer owns are the glue plus the render layer: the citation grammar shape (§10.1), the CSL-JSON source metadata contract (§10.2), the output pydantic models (§10.3), the inline-marker-to-reference coupling, the orchestration loop, and the six hand-written CSL formatters (APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver — see ADR-004). Everything else is a composition.

Phase plan¶

v0.1.0 shipped on 2026-04-24. Each phase was a mergeable milestone with its own exit criterion; see the frozen genesis at docs/spec/v0.md for the original plan.

Phase	Scope	Exit criterion
P0	Scaffolding: pyproject, CI, docs skeleton, .claude/	`make lint && make test && make docs-build` green; v0.0.1 publishes to TestPyPI
P1	Core types: `Source`, `Citation`, `Reference`, `GenerationResult`, `Policy`, `Backend` ABC	Contracts locked; mock backend works end-to-end
P2	HF backend with grammar-level logit enforcement (the flagship)	Smoke test: given N sources, model cannot emit `[N+k]` for any `k > 0`, across 100+ prompts
P3	Deterministic CSL reference rendering (home-grown, see ADR-004)	APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver render cleanly on the fixture set
P4	Metadata adapters: DOI, arXiv, PDF, URL, BibTeX, Zotero	VCR-backed CI tests plus a live smoke script
P5	vLLM and llama.cpp backends	All three local backends pass the same conformance suite
P6	NLI verification + hand-curated AI-papers benchmark	Coverage report shows support-rate gains; see `benchmarks/README.md`
Polish	REQUIRED progression fix (ADR-009), real CLI, examples as living reports	ADR-009 integration test passes; `citeformer` CLI covers generate/verify/render; `examples/` has runnable scripts with findings READMEs
Expansion	Marker-shape enum (ADR-011), OpenAI + Anthropic + Gemini + Mistral API backends, threshold calibration, multi-prompt + ALCE benchmarks, literature-review notebook, HF Space demo, GROBID PDF extractor	Seven backends pass a shared contract; 40-run multi-prompt sweep reports 0.0 ± 0.0 fabrication; PREPRINT.md describes the v0.1 design + evaluation
P7 (shipped)	v0.1.0 on PyPI + GitHub Release	`pip install citeformer==0.1.0` works; docs built on RTD; CI green across Python 3.11–3.14

Next-up (v0.2 scope TBD): full-ALCE reproducibility (ASQA / QAMPARI / ELI5), per-chunk NLI during generation, streaming refinements on API backends, and a possible citeformer-ts sibling if ecosystem demand materialises.

Tiered enforcement — where the masking runs¶

v0.1 framed the API/local split as “schema-tier vs logit-tier”, but as of late 2025 that’s no longer the honest line: every modern provider’s strict structured-outputs mode is real token-level constrained sampling inside their runtime, not post-hoc validation. The current honest distinction is where the masking runs — in your process, or inside the provider:

Backend	Where the masking runs	Mechanism	Notes
`HFBackend`	In-process	XGrammar `LogitsProcessor`	The flagship — you own the runtime.
`VLLMBackend`	In-process	XGrammar / llguidance via `GuidedDecodingParams`	Linux/CUDA only.
`LlamaCppBackend`	In-process	Native GBNF (`Llama(grammar=...)`)	CPU + Metal + CUDA.
`OpenAIBackend`	Provider runtime	Strict JSON schema	Token-level constrained sampling on `gpt-4o-2024-08-06+` and successors per OpenAI’s Aug 2024 announcement.
`AnthropicBackend`	Provider runtime	Native Citations API + `cache_control`	Provider enforces that every cite references a supplied document. Prompt-caching on by default — repeat-source RAG bills cache-read prices on subsequent calls.
`OpenRouterBackend`	Provider runtime (per upstream)	Strict JSON via OpenAI wire format	Routes to Anthropic / OpenAI / Google / Mistral / Groq / Fireworks / Together / Cohere. `provider.require_parameters: true` (default) refuses to land on upstreams that don’t honour strict mode — preserves the guarantee end-to-end.
`FireworksBackend`	Provider runtime	Native GBNF (`type: grammar`)	The cleanest “logit-tier on a hosted API” backend — citeformer’s `cite-id` GBNF rule is dropped in unchanged via Fireworks’s grammar mode. Same constraint that masks logits inside `HFBackend`, just running on Fireworks’s GPUs.
`TogetherBackend`	Provider runtime	Strict `json_schema`	Strict structured outputs on Together’s open-weight upstreams (Llama / Qwen / DeepSeek / …).
`GeminiBackend`	Provider runtime	`response_schema` (OpenAPI subset)	Constrained generation on Gemini 1.5+ / 2.x.
`MistralBackend`	Provider runtime	`response_format` strict JSON	`mistral-large-2411+`.

All eight backends produce the same GenerationResult — the orchestration, verify, and render layers are backend-agnostic. The choice between in-process and provider-runtime masking is mostly an operational question: do you want to host the model, or pay someone to do it? The structural guarantee — fabricated cite ids are token-impossible to emit — holds either way.

The bibliography pipeline is unchanged regardless: references are rendered deterministically by our home-grown formatters, never by the model.

Token usage + cost¶

API-backend GenerationResult carries a usage: TokenUsage | None field with input_tokens, output_tokens, optional cache_creation_input_tokens / cache_read_input_tokens (Anthropic prompt-caching), and cost_credits (OpenRouter exposes a per-call cost in OR credits — 1 credit ≈ $1 USD by default but the unit is credits, not dollars; other providers leave it None and consumers price tokens themselves). Local backends leave usage = None — token accounting is meaningless when you control the runtime.