`citeformer.core`¶

Core types for citeformer.

Contains the §10.2 (Source.metadata CSL-JSON shape) and §10.3 (GenerationResult output schema) contracts — both are pinned by snapshot tests in tests/integration/test_schemas.py. Touching any of these models requires the ceremony documented in docs/reference/contracts.md.

Module Contents¶

Classes¶

`Policy`	Citation enforcement policy.
`MarkerStyle`	Inline citation-marker visual shape.
`Source`	A piece of evidence made available to the model.
`TokenUsage`	Token-level cost accounting for one `Backend.generate()` call.
`Citation`	A single inline citation marker emitted by the model.
`Reference`	A rendered bibliography entry paired with its inline marker.
`GenerationResult`	Full output of a `Citeformer.generate()` call.

API¶

class citeformer.core.Policy¶

Bases: enum.StrEnum

Citation enforcement policy.

REQUIRED: every sentence must end with at least one citation (strictest; default). The grammar bounds per-sentence content at DEFAULT_MAX_CONTENT_CHARS to guarantee progression even on small models — see docs/decisions/009-bounded-content-required.md.
QUOTES_ONLY: only quoted spans require a citation; narrative sentences can stand alone.
AUTO: citations are optional at every position; verify() surfaces missing citations via the coverage check instead of rejecting them at decode time.

Initialization

Initialize self. See help(type(self)) for accurate signature.

REQUIRED¶: ‘required’

QUOTES_ONLY¶: ‘quotes_only’

AUTO¶: ‘auto’

class citeformer.core.MarkerStyle¶

Bases: enum.StrEnum

Inline citation-marker visual shape.

Orthogonal to :class:Policy — the structural guarantee (“digit enum is bounded by len(sources)”) holds for every marker style because the grammar enumerates the same set of ids regardless of which delimiters bracket them.

BRACKET (default): [1] — numeric styles, IEEE / Vancouver shape.
PAREN: (1) — used by some author-year styles and legacy newspaper conventions.
CURLY: {1} — less common but useful when the downstream pipeline already reserves square brackets (e.g. Markdown link syntax).
CARET: ^1 — caret-prefixed numeric, a footnote-style inline without the superscript Unicode.

Picking a non-bracket marker does not change the §10.1 structural guarantee — it just changes the delimiters used at both the grammar terminal and the post-hoc parse regex.

Initialization

Initialize self. See help(type(self)) for accurate signature.

BRACKET¶: ‘bracket’

PAREN¶: ‘paren’

CURLY¶: ‘curly’

CARET¶: ‘caret’

class citeformer.core.Source(/, **data: Any)¶

Bases: pydantic.BaseModel

A piece of evidence made available to the model.

Position in the sources list passed to Citeformer.generate() determines the citation index used by the model and echoed back in Citation.source_id and Reference.source_id — it is always 1-indexed.

§10.2 contract: metadata must be a CSL-JSON item — the shape our home-grown formatters (and, historically, citeproc-py) consume. See https://github.com/citation-style-language/schema for the spec.

Attributes: metadata: CSL-JSON item with at least id, type, and whatever fields the selected CSL style needs to render the entry (author, title, issued, container-title, DOI, URL, …). content: Raw chunk text the model may cite from. Passed into the prompt; also used by verify() for NLI entailment.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config¶: ‘ConfigDict(…)’

metadata: dict[str, Any]¶: ‘Field(…)’

content: str¶: ‘Field(…)’

classmethod from_doi(doi: str, **kwargs: Any) → Self¶

Build a Source from a Crossref DOI lookup.

The returned content field is empty — DOI metadata alone doesn’t ship the paper text. If you have the PDF, use Source.from_pdf to get the text and merge with metadata=source.metadata | pdf_meta, or construct the combined Source directly.

Args: doi: DOI in bare, URL, or doi: form. **kwargs: Forwarded to citeformer.metadata.fetch_crossref (timeout, use_cache).

Returns: A Source with metadata = CSL-JSON from Crossref and empty content.

classmethod from_arxiv(arxiv_id: str, **kwargs: Any) → Self¶

Build a Source from an arXiv API lookup.

The abstract becomes content so the model has something concrete to cite. For the full paper body, fetch the PDF and use Source.from_pdf separately.

Args: arxiv_id: arXiv id (bare, URL, or arxiv: form; version suffix is stripped). **kwargs: Forwarded to citeformer.metadata.fetch_arxiv.

Returns: A Source with the arXiv CSL-JSON and the abstract in content.

classmethod from_pdf(path: str | Any, **kwargs: Any) → Self¶

Build a Source from a local PDF.

Args: path: Filesystem path to the PDF. **kwargs: Forwarded to citeformer.metadata.extract_pdf. The important ones:

    - ``extractor`` (``"pypdf"`` | ``"grobid"``, default
      ``"pypdf"``). ``"grobid"`` requires ``pip install
      citeformer[grobid]`` + a running GROBID server
      (typical dev setup: ``docker run -p 8070:8070
      grobid/grobid:0.8.0``).
    - ``grobid_url`` (default ``http://localhost:8070``) when
      using the GROBID extractor.

Returns: A Source with best-effort CSL metadata. pypdf pulls title/author/issued from the PDF info dict when set; GROBID additionally returns clean author lists (family/given), an abstract field, and section-level body text.

classmethod from_url(url: str, **kwargs: Any) → Self¶

Build a Source from an HTTP(S) URL.

Uses readability-lxml for the article body and meta-tag parsing (OpenGraph / Twitter / article) for title / author / date / site.

Args: url: HTTP(S) URL. **kwargs: Forwarded to citeformer.metadata.extract_url.

Returns: A Source with webpage CSL metadata and the article body in content.

classmethod from_bibtex(source: str | Any, **kwargs: Any) → list[Self]¶

Build Source instances from a BibTeX file or string.

Each BibTeX entry becomes one Source. content is left empty — BibTeX is bibliographic metadata only. Users who need chunk text should either extend the returned items after load (e.g. pair with PDF fetches for the same DOI) or use Source.from_doi for per-entry DOI lookups.

Args: source: Filesystem path to a .bib file or a BibTeX string. **kwargs: Reserved for future options (none currently).

Returns: A list of Source objects in document order.

classmethod from_zotero(source: str | Any, **kwargs: Any) → list[Self]¶

Build Source instances from a Zotero “Export → CSL JSON” file.

The CSL JSON export is the shape we consume natively; this classmethod is sugar for [Source(metadata=item, content="") for item in load_zotero_csl(path)]. Also supports the Better BibTeX CSL-JSON export format (identical schema).

Args: source: Filesystem path to a .json export, raw JSON string, or an iterable of items. **kwargs: Forwarded to :func:citeformer.metadata.load_zotero_csl (filter_fn, dedupe).

Returns: A list of Source objects in the export’s order.

class citeformer.core.TokenUsage(/, **data: Any)¶

Bases: pydantic.BaseModel

Token-level cost accounting for one Backend.generate() call.

Populated by API backends from their provider’s per-call usage payload and threaded onto :class:GenerationResult.usage by the orchestrator. Local backends leave this None — token accounting is meaningless when you control the runtime and the bill is just GPU time.

Cache fields are populated when the provider exposes prompt-caching info (Anthropic surfaces cache_creation_input_tokens / cache_read_input_tokens; the OpenAI-compatible prompt_tokens_details cached-tokens field is normalised into the same shape). Consumers aggregating cost should sum input_tokens + cache_creation_input_tokens + cache_read_input_tokens against the provider’s per-tier price (cache-read tokens are typically cheaper than fresh input tokens).

cost_credits is filled in by providers that report a per-call cost directly. Today only OpenRouter does so via usage.cost — and the value is denominated in OpenRouter credits, not USD (1 credit ≈ $1 USD by default but the unit is credits, not dollars; see https://openrouter.ai/docs/guides/administration/usage-accounting). Other backends leave the field None and consumers compute cost from token counts themselves.

Attributes: input_tokens: Prompt + system + document tokens billed as input. Excludes cache-read tokens (those are reported separately). output_tokens: Tokens the model generated. cache_creation_input_tokens: Tokens billed at the cache-write rate. None if the provider doesn’t surface caching metadata. cache_read_input_tokens: Tokens served from cache (typically billed at a discount). None if the provider doesn’t surface caching. cost_credits: Provider-reported call cost in provider-native units (OpenRouter credits today). None when not exposed.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config¶: ‘ConfigDict(…)’

input_tokens: int¶: ‘Field(…)’

output_tokens: int¶: ‘Field(…)’

cache_creation_input_tokens: int | None¶: ‘Field(…)’

cache_read_input_tokens: int | None¶: ‘Field(…)’

cost_credits: float | None¶: ‘Field(…)’

class citeformer.core.Citation(/, **data: Any)¶

Bases: pydantic.BaseModel

A single inline citation marker emitted by the model.

Attributes: span: (start, end) character offsets of the marker inside GenerationResult.text. source_id: 1-indexed position of the cited source inside the sources list that was passed to Citeformer.generate(). verified: Populated by GenerationResult.verify(); False until then. True iff the cited source entails the citing claim with score above threshold. entailment_score: Populated by GenerationResult.verify(); None until then. Value in [0, 1] indicating NLI entailment confidence. cited_text: When the backend exposes it (Anthropic Citations API does; others don’t), the exact span of source text the model cited. Lets downstream code show “the model cited this passage” without recomputing — and lets verifiers run NLI against the cited span instead of the whole source. None on backends without span-level attribution. source_span: (start, end) char offsets inside the source content that cited_text came from. None on backends without span-level attribution. Anthropic returns these as start_char_index / end_char_index for plain-text documents. document_title: The source’s title as the provider saw it. Mostly a convenience mirror of Source.metadata['title'] — populated when the backend echoes a title back (Anthropic’s Citations API attaches document_title to every citation in 2025+ payloads).

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config¶: ‘ConfigDict(…)’

span: tuple[int, int]¶: ‘Field(…)’

source_id: int¶: ‘Field(…)’

verified: bool¶: ‘Field(…)’

entailment_score: float | None¶: ‘Field(…)’

cited_text: str | None¶: ‘Field(…)’

source_span: tuple[int, int] | None¶: ‘Field(…)’

document_title: str | None¶: ‘Field(…)’

class citeformer.core.Reference(/, **data: Any)¶

Bases: pydantic.BaseModel

A rendered bibliography entry paired with its inline marker.

Every cited source_id has exactly one Reference in GenerationResult.references. Rendering is deterministic via the home-grown render/formatters/ — the model never touches this.

Attributes: source_id: The 1-indexed source this reference describes. Matches the source_id of every Citation that points at this reference. inline_marker: How the marker appears in prose. For numeric styles this is "[1]"; for author-year styles "(Poe 1845)"; for footnote styles "¹". The renderer chooses based on the selected CSL style. rendered: Full bibliography entry, rendered by the style’s formatter. E.g. "Poe, E. A. (1845). The Raven. ...".

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config¶: ‘ConfigDict(…)’

source_id: int¶: ‘Field(…)’

inline_marker: str¶: ‘Field(…)’

rendered: str¶: ‘Field(…)’

class citeformer.core.GenerationResult(/, **data: Any)¶

Bases: pydantic.BaseModel

Full output of a Citeformer.generate() call.

§10.3 contract: schema_version is pinned by tests/integration/test_schemas.py. Any shape change requires bumping schema_version and following the ceremony in docs/reference/contracts.md. Current version: 3 — added the optional usage field so API-backend callers see token counts and (where the provider exposes it) per-call USD cost without reaching into the raw response. See docs/decisions/012-generation-result-schema-v3.md.

Attributes: schema_version: Contract version. Bump on any field add/rename/removal. text: The generated prose with inline [N] markers. citations: One entry per [N] marker, with its char span and source_id. references: Deterministically rendered bibliography, one entry per unique cited source_id. Rendered by the citeformer.render formatters — never by the LLM. sources: The sources that were in scope for this generation call. Carried on the result so verify() can run NLI against them without the caller having to pass them separately. usage: Token counts (and provider-reported cost when exposed) for the backend call that produced this result. None for local backends — token accounting is meaningless when you control the runtime.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config¶: ‘ConfigDict(…)’

schema_version: int¶: ‘Field(…)’

text: str¶: ‘Field(…)’

citations: list[citeformer.core.Citation]¶: ‘Field(…)’

references: list[citeformer.core.Reference]¶: ‘Field(…)’

sources: list[citeformer.core.Source]¶: ‘Field(…)’

usage: citeformer.core.TokenUsage | None¶: ‘Field(…)’

verify(*, threshold: float = 0.5, nli: Any | None = None, run_coverage: bool = True, **_options: Any) → citeformer.verify.report.VerificationReport¶

Run NLI-based verification against the cited sources.

Requires the verify extra (pip install citeformer[verify]) — the NLI backend is imported lazily on first call.

Args: threshold: Entailment probability above which a citation is supported and an uncited sentence is flagged as needing a citation. nli: Optional pre-constructed citeformer.verify.NLIModel. If None, the default model (DeBERTa-v3-large-MNLI, or whatever CITEFORMER_NLI_MODEL is set to) is loaded on first use and cached. run_coverage: If False, skip the NLI coverage check (per-sentence “should this have been cited?” scan). Useful under REQUIRED policy where the grammar guarantees every sentence has a cite.

Returns: A VerificationReport with per-citation entailment scores, an overall support rate, and uncited-but-entailed flags.

Raises: ImportError: If citeformer[verify] extras aren’t installed. ValueError: If this result was constructed without sources (e.g. a schema_version=1 serialization that predates the current shape).

citeformer.core¶

Module Contents¶

Classes¶

API¶

`citeformer.core`¶