citeformer.core

Core types for citeformer.

Contains the §10.2 (Source.metadata CSL-JSON shape) and §10.3 (GenerationResult output schema) contracts — both are pinned by snapshot tests in tests/integration/test_schemas.py. Touching any of these models requires the ceremony documented in docs/reference/contracts.md.

Module Contents

Classes

Policy

Citation enforcement policy.

MarkerStyle

Inline citation-marker visual shape.

Source

A piece of evidence made available to the model.

TokenUsage

Token-level cost accounting for one Backend.generate() call.

Citation

A single inline citation marker emitted by the model.

Reference

A rendered bibliography entry paired with its inline marker.

GenerationResult

Full output of a Citeformer.generate() call.

API

class citeformer.core.Policy

Bases: enum.StrEnum

Citation enforcement policy.

  • REQUIRED: every sentence must end with at least one citation (strictest; default). The grammar bounds per-sentence content at DEFAULT_MAX_CONTENT_CHARS to guarantee progression even on small models — see docs/decisions/009-bounded-content-required.md.

  • QUOTES_ONLY: only quoted spans require a citation; narrative sentences can stand alone.

  • AUTO: citations are optional at every position; verify() surfaces missing citations via the coverage check instead of rejecting them at decode time.

Initialization

Initialize self. See help(type(self)) for accurate signature.

REQUIRED

‘required’

QUOTES_ONLY

‘quotes_only’

AUTO

‘auto’

class citeformer.core.MarkerStyle

Bases: enum.StrEnum

Inline citation-marker visual shape.

Orthogonal to :class:Policy — the structural guarantee (“digit enum is bounded by len(sources)”) holds for every marker style because the grammar enumerates the same set of ids regardless of which delimiters bracket them.

  • BRACKET (default): [1] — numeric styles, IEEE / Vancouver shape.

  • PAREN: (1) — used by some author-year styles and legacy newspaper conventions.

  • CURLY: {1} — less common but useful when the downstream pipeline already reserves square brackets (e.g. Markdown link syntax).

  • CARET: ^1 — caret-prefixed numeric, a footnote-style inline without the superscript Unicode.

Picking a non-bracket marker does not change the §10.1 structural guarantee — it just changes the delimiters used at both the grammar terminal and the post-hoc parse regex.

Initialization

Initialize self. See help(type(self)) for accurate signature.

BRACKET

‘bracket’

PAREN

‘paren’

CURLY

‘curly’

CARET

‘caret’

class citeformer.core.Source(/, **data: Any)

Bases: pydantic.BaseModel

A piece of evidence made available to the model.

Position in the sources list passed to Citeformer.generate() determines the citation index used by the model and echoed back in Citation.source_id and Reference.source_id — it is always 1-indexed.

§10.2 contract: metadata must be a CSL-JSON item — the shape our home-grown formatters (and, historically, citeproc-py) consume. See https://github.com/citation-style-language/schema for the spec.

Attributes: metadata: CSL-JSON item with at least id, type, and whatever fields the selected CSL style needs to render the entry (author, title, issued, container-title, DOI, URL, …). content: Raw chunk text the model may cite from. Passed into the prompt; also used by verify() for NLI entailment.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

metadata: dict[str, Any]

‘Field(…)’

content: str

‘Field(…)’

classmethod from_doi(doi: str, **kwargs: Any) Self

Build a Source from a Crossref DOI lookup.

The returned content field is empty — DOI metadata alone doesn’t ship the paper text. If you have the PDF, use Source.from_pdf to get the text and merge with metadata=source.metadata | pdf_meta, or construct the combined Source directly.

Args: doi: DOI in bare, URL, or doi: form. **kwargs: Forwarded to citeformer.metadata.fetch_crossref (timeout, use_cache).

Returns: A Source with metadata = CSL-JSON from Crossref and empty content.

classmethod from_arxiv(arxiv_id: str, **kwargs: Any) Self

Build a Source from an arXiv API lookup.

The abstract becomes content so the model has something concrete to cite. For the full paper body, fetch the PDF and use Source.from_pdf separately.

Args: arxiv_id: arXiv id (bare, URL, or arxiv: form; version suffix is stripped). **kwargs: Forwarded to citeformer.metadata.fetch_arxiv.

Returns: A Source with the arXiv CSL-JSON and the abstract in content.

classmethod from_pdf(path: str | Any, **kwargs: Any) Self

Build a Source from a local PDF.

Args: path: Filesystem path to the PDF. **kwargs: Forwarded to citeformer.metadata.extract_pdf. The important ones:

    - ``extractor`` (``"pypdf"`` | ``"grobid"``, default
      ``"pypdf"``). ``"grobid"`` requires ``pip install
      citeformer[grobid]`` + a running GROBID server
      (typical dev setup: ``docker run -p 8070:8070
      grobid/grobid:0.8.0``).
    - ``grobid_url`` (default ``http://localhost:8070``) when
      using the GROBID extractor.

Returns: A Source with best-effort CSL metadata. pypdf pulls title/author/issued from the PDF info dict when set; GROBID additionally returns clean author lists (family/given), an abstract field, and section-level body text.

classmethod from_url(url: str, **kwargs: Any) Self

Build a Source from an HTTP(S) URL.

Uses readability-lxml for the article body and meta-tag parsing (OpenGraph / Twitter / article) for title / author / date / site.

Args: url: HTTP(S) URL. **kwargs: Forwarded to citeformer.metadata.extract_url.

Returns: A Source with webpage CSL metadata and the article body in content.

classmethod from_bibtex(source: str | Any, **kwargs: Any) list[Self]

Build Source instances from a BibTeX file or string.

Each BibTeX entry becomes one Source. content is left empty — BibTeX is bibliographic metadata only. Users who need chunk text should either extend the returned items after load (e.g. pair with PDF fetches for the same DOI) or use Source.from_doi for per-entry DOI lookups.

Args: source: Filesystem path to a .bib file or a BibTeX string. **kwargs: Reserved for future options (none currently).

Returns: A list of Source objects in document order.

classmethod from_zotero(source: str | Any, **kwargs: Any) list[Self]

Build Source instances from a Zotero “Export → CSL JSON” file.

The CSL JSON export is the shape we consume natively; this classmethod is sugar for [Source(metadata=item, content="") for item in load_zotero_csl(path)]. Also supports the Better BibTeX CSL-JSON export format (identical schema).

Args: source: Filesystem path to a .json export, raw JSON string, or an iterable of items. **kwargs: Forwarded to :func:citeformer.metadata.load_zotero_csl (filter_fn, dedupe).

Returns: A list of Source objects in the export’s order.

class citeformer.core.TokenUsage(/, **data: Any)

Bases: pydantic.BaseModel

Token-level cost accounting for one Backend.generate() call.

Populated by API backends from their provider’s per-call usage payload and threaded onto :class:GenerationResult.usage by the orchestrator. Local backends leave this None — token accounting is meaningless when you control the runtime and the bill is just GPU time.

Cache fields are populated when the provider exposes prompt-caching info (Anthropic surfaces cache_creation_input_tokens / cache_read_input_tokens; the OpenAI-compatible prompt_tokens_details cached-tokens field is normalised into the same shape). Consumers aggregating cost should sum input_tokens + cache_creation_input_tokens + cache_read_input_tokens against the provider’s per-tier price (cache-read tokens are typically cheaper than fresh input tokens).

cost_credits is filled in by providers that report a per-call cost directly. Today only OpenRouter does so via usage.cost — and the value is denominated in OpenRouter credits, not USD (1 credit ≈ $1 USD by default but the unit is credits, not dollars; see https://openrouter.ai/docs/guides/administration/usage-accounting). Other backends leave the field None and consumers compute cost from token counts themselves.

Attributes: input_tokens: Prompt + system + document tokens billed as input. Excludes cache-read tokens (those are reported separately). output_tokens: Tokens the model generated. cache_creation_input_tokens: Tokens billed at the cache-write rate. None if the provider doesn’t surface caching metadata. cache_read_input_tokens: Tokens served from cache (typically billed at a discount). None if the provider doesn’t surface caching. cost_credits: Provider-reported call cost in provider-native units (OpenRouter credits today). None when not exposed.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

input_tokens: int

‘Field(…)’

output_tokens: int

‘Field(…)’

cache_creation_input_tokens: int | None

‘Field(…)’

cache_read_input_tokens: int | None

‘Field(…)’

cost_credits: float | None

‘Field(…)’

class citeformer.core.Citation(/, **data: Any)

Bases: pydantic.BaseModel

A single inline citation marker emitted by the model.

Attributes: span: (start, end) character offsets of the marker inside GenerationResult.text. source_id: 1-indexed position of the cited source inside the sources list that was passed to Citeformer.generate(). verified: Populated by GenerationResult.verify(); False until then. True iff the cited source entails the citing claim with score above threshold. entailment_score: Populated by GenerationResult.verify(); None until then. Value in [0, 1] indicating NLI entailment confidence. cited_text: When the backend exposes it (Anthropic Citations API does; others don’t), the exact span of source text the model cited. Lets downstream code show “the model cited this passage” without recomputing — and lets verifiers run NLI against the cited span instead of the whole source. None on backends without span-level attribution. source_span: (start, end) char offsets inside the source content that cited_text came from. None on backends without span-level attribution. Anthropic returns these as start_char_index / end_char_index for plain-text documents. document_title: The source’s title as the provider saw it. Mostly a convenience mirror of Source.metadata['title'] — populated when the backend echoes a title back (Anthropic’s Citations API attaches document_title to every citation in 2025+ payloads).

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

span: tuple[int, int]

‘Field(…)’

source_id: int

‘Field(…)’

verified: bool

‘Field(…)’

entailment_score: float | None

‘Field(…)’

cited_text: str | None

‘Field(…)’

source_span: tuple[int, int] | None

‘Field(…)’

document_title: str | None

‘Field(…)’

class citeformer.core.Reference(/, **data: Any)

Bases: pydantic.BaseModel

A rendered bibliography entry paired with its inline marker.

Every cited source_id has exactly one Reference in GenerationResult.references. Rendering is deterministic via the home-grown render/formatters/the model never touches this.

Attributes: source_id: The 1-indexed source this reference describes. Matches the source_id of every Citation that points at this reference. inline_marker: How the marker appears in prose. For numeric styles this is "[1]"; for author-year styles "(Poe 1845)"; for footnote styles "¹". The renderer chooses based on the selected CSL style. rendered: Full bibliography entry, rendered by the style’s formatter. E.g. "Poe, E. A. (1845). The Raven. ...".

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

source_id: int

‘Field(…)’

inline_marker: str

‘Field(…)’

rendered: str

‘Field(…)’

class citeformer.core.GenerationResult(/, **data: Any)

Bases: pydantic.BaseModel

Full output of a Citeformer.generate() call.

§10.3 contract: schema_version is pinned by tests/integration/test_schemas.py. Any shape change requires bumping schema_version and following the ceremony in docs/reference/contracts.md. Current version: 3 — added the optional usage field so API-backend callers see token counts and (where the provider exposes it) per-call USD cost without reaching into the raw response. See docs/decisions/012-generation-result-schema-v3.md.

Attributes: schema_version: Contract version. Bump on any field add/rename/removal. text: The generated prose with inline [N] markers. citations: One entry per [N] marker, with its char span and source_id. references: Deterministically rendered bibliography, one entry per unique cited source_id. Rendered by the citeformer.render formatters — never by the LLM. sources: The sources that were in scope for this generation call. Carried on the result so verify() can run NLI against them without the caller having to pass them separately. usage: Token counts (and provider-reported cost when exposed) for the backend call that produced this result. None for local backends — token accounting is meaningless when you control the runtime.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

schema_version: int

‘Field(…)’

text: str

‘Field(…)’

citations: list[citeformer.core.Citation]

‘Field(…)’

references: list[citeformer.core.Reference]

‘Field(…)’

sources: list[citeformer.core.Source]

‘Field(…)’

usage: citeformer.core.TokenUsage | None

‘Field(…)’

verify(*, threshold: float = 0.5, nli: Any | None = None, run_coverage: bool = True, **_options: Any) citeformer.verify.report.VerificationReport

Run NLI-based verification against the cited sources.

Requires the verify extra (pip install citeformer[verify]) — the NLI backend is imported lazily on first call.

Args: threshold: Entailment probability above which a citation is supported and an uncited sentence is flagged as needing a citation. nli: Optional pre-constructed citeformer.verify.NLIModel. If None, the default model (DeBERTa-v3-large-MNLI, or whatever CITEFORMER_NLI_MODEL is set to) is loaded on first use and cached. run_coverage: If False, skip the NLI coverage check (per-sentence “should this have been cited?” scan). Useful under REQUIRED policy where the grammar guarantees every sentence has a cite.

Returns: A VerificationReport with per-citation entailment scores, an overall support rate, and uncited-but-entailed flags.

Raises: ImportError: If citeformer[verify] extras aren’t installed. ValueError: If this result was constructed without sources (e.g. a schema_version=1 serialization that predates the current shape).