citeformer.grammar.builder¶
Citation-grammar construction — §10.1 contract implementation.
Emits GBNF grammars consumable by XGrammar / llguidance / llama.cpp. The point
of this module is the cite-id rule:
cite-id ::= OPEN ("1" | "2" | ... | "N") CLOSE
where OPEN / CLOSE are the delimiters for the chosen
- class:
~citeformer.core.MarkerStyle(default[…]), andNis dynamically set tolen(sources)per generate() call. That’s what makes a fabricated citation ([N+k]for anyk > 0) a logit-level impossibility when the downstream backend masks against this grammar — regardless of which marker shape is chosen.
Three policies sit on top:
REQUIRED: every sentence must endcontent cite-group sent-end. The model can’t close a sentence without citing.contentis bounded tomax_content_chars(default 240) so small models can’t stall in content state indefinitely — seedocs/decisions/009-bounded-content-required.md.QUOTES_ONLY: only quoted spans require a trailingcite-group. Narrative prose can stand alone.AUTO:cite-groupis allowed anywhere but not required. Theverify()coverage check surfaces missing citations post-hoc instead.
Format note: we emit GBNF (the GGML grammar format used by llama.cpp and
xgrammar) rather than Lark because xgrammar’s parser expects ::= not :.
Semantically equivalent; just a syntax swap. Semantic validity is exercised at
integration time — the HF backend’s test_hf_backend_grammar_compiles compiles
the emitted string with xgrammar, which is the authoritative parser.
Module Contents¶
Classes¶
Delimiter configuration for a :class: |
|
A citation-constraining GBNF grammar for one generation call. |
Functions¶
Build the citation-constraining GBNF grammar for a generation call. |
Data¶
API¶
- citeformer.grammar.builder.DEFAULT_MAX_CONTENT_CHARS¶
240
- class citeformer.grammar.builder.MarkerSpec¶
Delimiter configuration for a :class:
MarkerStyle.Attributes: open_char: Single character that opens a marker (e.g.
[/(/{/^). Excluded from the grammar’stext/contentcharacter classes so the parser knows when a marker starts. close_char: Single character that closes a marker, or empty string for open-ended markers like^N. Not part of the exclusion set because it can appear in regular prose.
- citeformer.grammar.builder.MARKER_SPECS: dict[citeformer.core.MarkerStyle, citeformer.grammar.builder.MarkerSpec]¶
None
- class citeformer.grammar.builder.Grammar¶
A citation-constraining GBNF grammar for one generation call.
Attributes: gbnf: Full GBNF grammar string. Accepted by XGrammar’s
compile_grammar()and by llama.cpp’s native GBNF support. cite_ids: 1-indexed source ids that the grammar admits, in ascending order. Derived fromlen(sources)at build time. policy: Enforcement policy that shaped the grammar body. marker_style: Delimiter shape used by thecite-idterminal. Defaults to :attr:MarkerStyle.BRACKETto match §10.1’s canonical[N]shape. root_rule: The entry rule name. Always"root"— GBNF convention; also xgrammar’s default so no explicitroot_rule_nameoverride needed. max_content_chars: Upper bound oncontentrepetition for the REQUIRED policy.Nonemeans unbounded (legacy+). For AUTO and QUOTES_ONLY this field isNonebecause the bound only applies to REQUIRED.- policy: citeformer.core.Policy¶
None
- marker_style: citeformer.core.MarkerStyle¶
None
- citeformer.grammar.builder.build_grammar(n_sources: int, policy: citeformer.core.Policy, *, max_content_chars: int | None = DEFAULT_MAX_CONTENT_CHARS, marker_style: citeformer.core.MarkerStyle = MarkerStyle.BRACKET) citeformer.grammar.builder.Grammar¶
Build the citation-constraining GBNF grammar for a generation call.
Args: n_sources: Number of sources in scope. Must be >= 1. Determines the set of valid cite ids (1..n_sources inclusive). policy: Citation enforcement policy. max_content_chars: Soft progression bound for the REQUIRED policy. After this many characters of content since the last sentence terminator, the grammar forces the model into a citation — closing the ADR-007 stall loophole. Set
Noneto disable bounding (legacy behavior; risks stall on small models). Ignored for AUTO and QUOTES_ONLY policies, which have no sentence-level shape to bound. Seedocs/decisions/009-bounded-content-required.md. marker_style: Visual shape for inline markers. Defaults to :attr:MarkerStyle.BRACKET([N]— §10.1’s canonical form). Swap toPAREN/CURLY/CARETwhen you need the marker to not clash with downstream syntax (e.g. Markdown link syntax reserves[/]). The digit-enum structural guarantee is identical across styles.Returns: A
Grammarwith the rendered GBNF and the metadata backends need.Raises: ValueError: If
n_sources < 1, or ifmax_content_charsis< 1(useNonefor unbounded). NotImplementedError: Ifpolicyis not one of thePolicyenum values (e.g. a future variant that a user might have hand-cast).