Architecture

citeformer is deliberately thin. The hard technical work — token masking, CSL rendering, PDF extraction, NLI — already lives in well-maintained dependencies. Our job is to compose them behind a single honest API.

Six-layer dependency order

CLI → orchestration (Citeformer) → verify → render → backends → grammar → core

Upper layers depend only on lower. A render module must never import from backends; a backend must never reach up into orchestration. Break this and the refactor radius explodes.

Piggyback-first

Before writing new code, ask: is this already done by one of these?

We piggyback on

For

XGrammar / llguidance

Grammar-level token masking at generation time

transformers (HF)

Running local causal LMs

vLLM

High-throughput inference with --guided-decoding-backend

llama.cpp (llama-cpp-python)

CPU / Apple Silicon inference with GBNF grammars

openai / anthropic / google-genai / mistralai

API-provider generation clients (the openai SDK is also the wire client for OpenRouter)

lark

Authoring the citation grammar before handing off to the decoder

httpx + diskcache

Metadata fetchers (Crossref, arXiv) with polite caching

pypdf / grobid-client-python

PDF text extraction — pypdf default, GROBID opt-in for cleaner scientific-paper parsing

readability-lxml

URL extraction

DeBERTa-v3-MNLI (via transformers)

NLI entailment for verify()

pydantic + typer + rich

Types, CLI, pretty output

The parts citeformer owns are the glue plus the render layer: the citation grammar shape (§10.1), the CSL-JSON source metadata contract (§10.2), the output pydantic models (§10.3), the inline-marker-to-reference coupling, the orchestration loop, and the six hand-written CSL formatters (APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver — see ADR-004). Everything else is a composition.

Phase plan

v0.1.0 shipped on 2026-04-24. Each phase was a mergeable milestone with its own exit criterion; see the frozen genesis at docs/spec/v0.md for the original plan.

Phase

Scope

Exit criterion

P0

Scaffolding: pyproject, CI, docs skeleton, .claude/

make lint && make test && make docs-build green; v0.0.1 publishes to TestPyPI

P1

Core types: Source, Citation, Reference, GenerationResult, Policy, Backend ABC

Contracts locked; mock backend works end-to-end

P2

HF backend with grammar-level logit enforcement (the flagship)

Smoke test: given N sources, model cannot emit [N+k] for any k > 0, across 100+ prompts

P3

Deterministic CSL reference rendering (home-grown, see ADR-004)

APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver render cleanly on the fixture set

P4

Metadata adapters: DOI, arXiv, PDF, URL, BibTeX, Zotero

VCR-backed CI tests plus a live smoke script

P5

vLLM and llama.cpp backends

All three local backends pass the same conformance suite

P6

NLI verification + hand-curated AI-papers benchmark

Coverage report shows support-rate gains; see benchmarks/README.md

Polish

REQUIRED progression fix (ADR-009), real CLI, examples as living reports

ADR-009 integration test passes; citeformer CLI covers generate/verify/render; examples/ has runnable scripts with findings READMEs

Expansion

Marker-shape enum (ADR-011), OpenAI + Anthropic + Gemini + Mistral API backends, threshold calibration, multi-prompt + ALCE benchmarks, literature-review notebook, HF Space demo, GROBID PDF extractor

Seven backends pass a shared contract; 40-run multi-prompt sweep reports 0.0 ± 0.0 fabrication; PREPRINT.md describes the v0.1 design + evaluation

P7 (shipped)

v0.1.0 on PyPI + GitHub Release

pip install citeformer==0.1.0 works; docs built on RTD; CI green across Python 3.11–3.14

Next-up (v0.2 scope TBD): full-ALCE reproducibility (ASQA / QAMPARI / ELI5), per-chunk NLI during generation, streaming refinements on API backends, and a possible citeformer-ts sibling if ecosystem demand materialises.

Tiered enforcement — where the masking runs

v0.1 framed the API/local split as “schema-tier vs logit-tier”, but as of late 2025 that’s no longer the honest line: every modern provider’s strict structured-outputs mode is real token-level constrained sampling inside their runtime, not post-hoc validation. The current honest distinction is where the masking runs — in your process, or inside the provider:

Backend

Where the masking runs

Mechanism

Notes

HFBackend

In-process

XGrammar LogitsProcessor

The flagship — you own the runtime.

VLLMBackend

In-process

XGrammar / llguidance via GuidedDecodingParams

Linux/CUDA only.

LlamaCppBackend

In-process

Native GBNF (Llama(grammar=...))

CPU + Metal + CUDA.

OpenAIBackend

Provider runtime

Strict JSON schema

Token-level constrained sampling on gpt-4o-2024-08-06+ and successors per OpenAI’s Aug 2024 announcement.

AnthropicBackend

Provider runtime

Native Citations API + cache_control

Provider enforces that every cite references a supplied document. Prompt-caching on by default — repeat-source RAG bills cache-read prices on subsequent calls.

OpenRouterBackend

Provider runtime (per upstream)

Strict JSON via OpenAI wire format

Routes to Anthropic / OpenAI / Google / Mistral / Groq / Fireworks / Together / Cohere. provider.require_parameters: true (default) refuses to land on upstreams that don’t honour strict mode — preserves the guarantee end-to-end.

FireworksBackend

Provider runtime

Native GBNF (type: grammar)

The cleanest “logit-tier on a hosted API” backend — citeformer’s cite-id GBNF rule is dropped in unchanged via Fireworks’s grammar mode. Same constraint that masks logits inside HFBackend, just running on Fireworks’s GPUs.

TogetherBackend

Provider runtime

Strict json_schema

Strict structured outputs on Together’s open-weight upstreams (Llama / Qwen / DeepSeek / …).

GeminiBackend

Provider runtime

response_schema (OpenAPI subset)

Constrained generation on Gemini 1.5+ / 2.x.

MistralBackend

Provider runtime

response_format strict JSON

mistral-large-2411+.

All eight backends produce the same GenerationResult — the orchestration, verify, and render layers are backend-agnostic. The choice between in-process and provider-runtime masking is mostly an operational question: do you want to host the model, or pay someone to do it? The structural guarantee — fabricated cite ids are token-impossible to emit — holds either way.

The bibliography pipeline is unchanged regardless: references are rendered deterministically by our home-grown formatters, never by the model.

Token usage + cost

API-backend GenerationResult carries a usage: TokenUsage | None field with input_tokens, output_tokens, optional cache_creation_input_tokens / cache_read_input_tokens (Anthropic prompt-caching), and cost_credits (OpenRouter exposes a per-call cost in OR credits — 1 credit ≈ $1 USD by default but the unit is credits, not dollars; other providers leave it None and consumers price tokens themselves). Local backends leave usage = None — token accounting is meaningless when you control the runtime.