Guarantees¶
What “bulletproof” actually means in citeformer.
What’s enforced, where¶
Property |
Local backends (HF, vLLM, llama.cpp) |
API backends (OpenAI, Gemini, Mistral) |
Provider-native (Anthropic) |
|---|---|---|---|
Citation marker cannot refer to a non-existent source |
Logit-layer — |
Schema-layer — |
Provider-native — the Citations API returns per-block |
Reference list always renders deterministically |
Yes — never touches the LLM |
Same |
Same |
Every inline marker has a matching reference, and vice versa |
Yes — coupled at render time |
Same |
Same |
Claim is actually supported by the cited source |
Verified — NLI entailment via |
Same |
Same |
Sentence without a citation is actually non-factual |
Flagged — NLI coverage check via |
Same |
Same |
Format matches the requested CSL style exactly |
Yes — six hand-written formatters (APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver) |
Same |
Same |
All seven backends produce the same GenerationResult — verify / render / streaming work identically across tiers.
What’s not enforced¶
Claim truth. citeformer enforces that if a claim is cited, the citation points at a real source; and via
verify(), that the source entails the claim. It does not verify the source itself is correct. Garbage in, cited garbage out.Policy appropriateness. You pick a
citation_policy(required,quotes_only,auto). citeformer enforces the grammar that policy implies — it doesn’t decide for you whether that policy is right for your domain.Retrieval quality. citeformer is downstream of your retriever. If you retrieved irrelevant chunks, the model has to cite them anyway (or hit the coverage-flag branch of
verify()).Natural sentence length under REQUIRED. The
REQUIREDpolicy bounds per-sentence content at 240 characters (configurable) to guarantee progression on small models — see ADR-009 for the structural fix to the ADR-007 stall. Sentences that would naturally run longer get clipped mid-clause, with the citation landing at clip point. Tunemax_content_charshigher for very long-sentence technical writing, or passNoneto disable bounding entirely.
Tier semantics — why we frame it honestly¶
Calling schema-layer enforcement “bulletproof” is defensible; calling it identical to logit-layer isn’t. The difference:
Logit tier (local backends): the sampler never sees an out-of-scope token. Fabrication is impossible in the generation process itself.
Schema tier (OpenAI, Gemini, Mistral): the provider’s validator rejects non-conforming responses server-side. Fabrication is impossible in the returned payload — but the provider’s own inner sampler is opaque to us.
Provider-native (Anthropic Citations API): the provider’s own citation system is the guarantee. We adapt their shape into
GenerationResult; we don’t add enforcement on top.
In practice the downstream consequence is identical: your code receives a GenerationResult whose cite ids are all in-scope. The framing difference matters only if you’re writing a threat model — for most users, all three tiers deliver “no fabricated citations, ever.”
Live evidence¶
Adversarial: prompt explicitly demands
[7]and[8]when only 6 sources are in scope. Baseline complies (100% fab); citeformer structurally can’t (0%).40-run multi-prompt sweep: 0.0 ± 0.0 fabrication across every (prompt, model, seed) cell.
OpenAI + Anthropic live-API tests: env-gated integration suite that hits production endpoints and verifies the structural invariant holds end-to-end.
Further reading¶
Architecture — the 6-layer design, the piggyback map, and the tiered enforcement section.
Contracts — the three §10 invariants that govern versioning.
PREPRINT — longer design + evaluation write-up.