citeformer.core¶
Core types for citeformer.
Contains the §10.2 (Source.metadata CSL-JSON shape) and §10.3 (GenerationResult
output schema) contracts — both are pinned by snapshot tests in
tests/integration/test_schemas.py. Touching any of these models requires
the ceremony documented in docs/reference/contracts.md.
Module Contents¶
Classes¶
Citation enforcement policy. |
|
Inline citation-marker visual shape. |
|
A piece of evidence made available to the model. |
|
Token-level cost accounting for one |
|
A single inline citation marker emitted by the model. |
|
A rendered bibliography entry paired with its inline marker. |
|
Full output of a |
API¶
- class citeformer.core.Policy¶
Bases:
enum.StrEnumCitation enforcement policy.
REQUIRED: every sentence must end with at least one citation (strictest; default). The grammar bounds per-sentence content atDEFAULT_MAX_CONTENT_CHARSto guarantee progression even on small models — seedocs/decisions/009-bounded-content-required.md.QUOTES_ONLY: only quoted spans require a citation; narrative sentences can stand alone.AUTO: citations are optional at every position;verify()surfaces missing citations via the coverage check instead of rejecting them at decode time.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- REQUIRED¶
‘required’
- QUOTES_ONLY¶
‘quotes_only’
- AUTO¶
‘auto’
- class citeformer.core.MarkerStyle¶
Bases:
enum.StrEnumInline citation-marker visual shape.
Orthogonal to :class:
Policy— the structural guarantee (“digit enum is bounded bylen(sources)”) holds for every marker style because the grammar enumerates the same set of ids regardless of which delimiters bracket them.BRACKET(default):[1]— numeric styles, IEEE / Vancouver shape.PAREN:(1)— used by some author-year styles and legacy newspaper conventions.CURLY:{1}— less common but useful when the downstream pipeline already reserves square brackets (e.g. Markdown link syntax).CARET:^1— caret-prefixed numeric, a footnote-style inline without the superscript Unicode.
Picking a non-bracket marker does not change the §10.1 structural guarantee — it just changes the delimiters used at both the grammar terminal and the post-hoc parse regex.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- BRACKET¶
‘bracket’
- PAREN¶
‘paren’
- CURLY¶
‘curly’
- CARET¶
‘caret’
- class citeformer.core.Source(/, **data: Any)¶
Bases:
pydantic.BaseModelA piece of evidence made available to the model.
Position in the
sourceslist passed toCiteformer.generate()determines the citation index used by the model and echoed back inCitation.source_idandReference.source_id— it is always 1-indexed.§10.2 contract:
metadatamust be a CSL-JSON item — the shape our home-grown formatters (and, historically, citeproc-py) consume. See https://github.com/citation-style-language/schema for the spec.Attributes: metadata: CSL-JSON item with at least
id,type, and whatever fields the selected CSL style needs to render the entry (author,title,issued,container-title,DOI,URL, …). content: Raw chunk text the model may cite from. Passed into the prompt; also used byverify()for NLI entailment.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- classmethod from_doi(doi: str, **kwargs: Any) Self¶
Build a
Sourcefrom a Crossref DOI lookup.The returned
contentfield is empty — DOI metadata alone doesn’t ship the paper text. If you have the PDF, useSource.from_pdfto get the text and merge withmetadata=source.metadata | pdf_meta, or construct the combinedSourcedirectly.Args: doi: DOI in bare, URL, or
doi:form. **kwargs: Forwarded tociteformer.metadata.fetch_crossref(timeout,use_cache).Returns: A
Sourcewithmetadata= CSL-JSON from Crossref and emptycontent.
- classmethod from_arxiv(arxiv_id: str, **kwargs: Any) Self¶
Build a
Sourcefrom an arXiv API lookup.The abstract becomes
contentso the model has something concrete to cite. For the full paper body, fetch the PDF and useSource.from_pdfseparately.Args: arxiv_id: arXiv id (bare, URL, or
arxiv:form; version suffix is stripped). **kwargs: Forwarded tociteformer.metadata.fetch_arxiv.Returns: A
Sourcewith the arXiv CSL-JSON and the abstract incontent.
- classmethod from_pdf(path: str | Any, **kwargs: Any) Self¶
Build a
Sourcefrom a local PDF.Args: path: Filesystem path to the PDF. **kwargs: Forwarded to
citeformer.metadata.extract_pdf. The important ones:- ``extractor`` (``"pypdf"`` | ``"grobid"``, default ``"pypdf"``). ``"grobid"`` requires ``pip install citeformer[grobid]`` + a running GROBID server (typical dev setup: ``docker run -p 8070:8070 grobid/grobid:0.8.0``). - ``grobid_url`` (default ``http://localhost:8070``) when using the GROBID extractor.Returns: A
Sourcewith best-effort CSL metadata. pypdf pullstitle/author/issuedfrom the PDF info dict when set; GROBID additionally returns clean author lists (family/given), anabstractfield, and section-level body text.
- classmethod from_url(url: str, **kwargs: Any) Self¶
Build a
Sourcefrom an HTTP(S) URL.Uses readability-lxml for the article body and meta-tag parsing (OpenGraph / Twitter / article) for title / author / date / site.
Args: url: HTTP(S) URL. **kwargs: Forwarded to
citeformer.metadata.extract_url.Returns: A
Sourcewith webpage CSL metadata and the article body incontent.
- classmethod from_bibtex(source: str | Any, **kwargs: Any) list[Self]¶
Build
Sourceinstances from a BibTeX file or string.Each BibTeX entry becomes one
Source.contentis left empty — BibTeX is bibliographic metadata only. Users who need chunk text should either extend the returned items after load (e.g. pair with PDF fetches for the same DOI) or useSource.from_doifor per-entry DOI lookups.Args: source: Filesystem path to a
.bibfile or a BibTeX string. **kwargs: Reserved for future options (none currently).Returns: A list of
Sourceobjects in document order.
- classmethod from_zotero(source: str | Any, **kwargs: Any) list[Self]¶
Build
Sourceinstances from a Zotero “Export → CSL JSON” file.The CSL JSON export is the shape we consume natively; this classmethod is sugar for
[Source(metadata=item, content="") for item in load_zotero_csl(path)]. Also supports the Better BibTeX CSL-JSON export format (identical schema).Args: source: Filesystem path to a
.jsonexport, raw JSON string, or an iterable of items. **kwargs: Forwarded to :func:citeformer.metadata.load_zotero_csl(filter_fn,dedupe).Returns: A list of
Sourceobjects in the export’s order.
- class citeformer.core.TokenUsage(/, **data: Any)¶
Bases:
pydantic.BaseModelToken-level cost accounting for one
Backend.generate()call.Populated by API backends from their provider’s per-call
usagepayload and threaded onto :class:GenerationResult.usageby the orchestrator. Local backends leave thisNone— token accounting is meaningless when you control the runtime and the bill is just GPU time.Cache fields are populated when the provider exposes prompt-caching info (Anthropic surfaces
cache_creation_input_tokens/cache_read_input_tokens; the OpenAI-compatibleprompt_tokens_detailscached-tokens field is normalised into the same shape). Consumers aggregating cost should suminput_tokens + cache_creation_input_tokens + cache_read_input_tokensagainst the provider’s per-tier price (cache-read tokens are typically cheaper than fresh input tokens).cost_creditsis filled in by providers that report a per-call cost directly. Today only OpenRouter does so viausage.cost— and the value is denominated in OpenRouter credits, not USD (1 credit ≈ $1 USD by default but the unit is credits, not dollars; see https://openrouter.ai/docs/guides/administration/usage-accounting). Other backends leave the fieldNoneand consumers compute cost from token counts themselves.Attributes: input_tokens: Prompt + system + document tokens billed as input. Excludes cache-read tokens (those are reported separately). output_tokens: Tokens the model generated. cache_creation_input_tokens: Tokens billed at the cache-write rate.
Noneif the provider doesn’t surface caching metadata. cache_read_input_tokens: Tokens served from cache (typically billed at a discount).Noneif the provider doesn’t surface caching. cost_credits: Provider-reported call cost in provider-native units (OpenRouter credits today).Nonewhen not exposed.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- class citeformer.core.Citation(/, **data: Any)¶
Bases:
pydantic.BaseModelA single inline citation marker emitted by the model.
Attributes: span:
(start, end)character offsets of the marker insideGenerationResult.text. source_id: 1-indexed position of the cited source inside thesourceslist that was passed toCiteformer.generate(). verified: Populated byGenerationResult.verify();Falseuntil then.Trueiff the cited source entails the citing claim with score above threshold. entailment_score: Populated byGenerationResult.verify();Noneuntil then. Value in [0, 1] indicating NLI entailment confidence. cited_text: When the backend exposes it (Anthropic Citations API does; others don’t), the exact span of source text the model cited. Lets downstream code show “the model cited this passage” without recomputing — and lets verifiers run NLI against the cited span instead of the whole source.Noneon backends without span-level attribution. source_span:(start, end)char offsets inside the source content thatcited_textcame from.Noneon backends without span-level attribution. Anthropic returns these asstart_char_index/end_char_indexfor plain-text documents. document_title: The source’s title as the provider saw it. Mostly a convenience mirror ofSource.metadata['title']— populated when the backend echoes a title back (Anthropic’s Citations API attachesdocument_titleto every citation in 2025+ payloads).Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- class citeformer.core.Reference(/, **data: Any)¶
Bases:
pydantic.BaseModelA rendered bibliography entry paired with its inline marker.
Every cited
source_idhas exactly oneReferenceinGenerationResult.references. Rendering is deterministic via the home-grownrender/formatters/— the model never touches this.Attributes: source_id: The 1-indexed source this reference describes. Matches the
source_idof everyCitationthat points at this reference. inline_marker: How the marker appears in prose. For numeric styles this is"[1]"; for author-year styles"(Poe 1845)"; for footnote styles"¹". The renderer chooses based on the selected CSL style. rendered: Full bibliography entry, rendered by the style’s formatter. E.g."Poe, E. A. (1845). The Raven. ...".Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- class citeformer.core.GenerationResult(/, **data: Any)¶
Bases:
pydantic.BaseModelFull output of a
Citeformer.generate()call.§10.3 contract:
schema_versionis pinned bytests/integration/test_schemas.py. Any shape change requires bumpingschema_versionand following the ceremony indocs/reference/contracts.md. Current version: 3 — added the optionalusagefield so API-backend callers see token counts and (where the provider exposes it) per-call USD cost without reaching into the raw response. Seedocs/decisions/012-generation-result-schema-v3.md.Attributes: schema_version: Contract version. Bump on any field add/rename/removal. text: The generated prose with inline
[N]markers. citations: One entry per[N]marker, with its char span andsource_id. references: Deterministically rendered bibliography, one entry per unique citedsource_id. Rendered by theciteformer.renderformatters — never by the LLM. sources: The sources that were in scope for this generation call. Carried on the result soverify()can run NLI against them without the caller having to pass them separately. usage: Token counts (and provider-reported cost when exposed) for the backend call that produced this result.Nonefor local backends — token accounting is meaningless when you control the runtime.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- citations: list[citeformer.core.Citation]¶
‘Field(…)’
- references: list[citeformer.core.Reference]¶
‘Field(…)’
- sources: list[citeformer.core.Source]¶
‘Field(…)’
- usage: citeformer.core.TokenUsage | None¶
‘Field(…)’
- verify(*, threshold: float = 0.5, nli: Any | None = None, run_coverage: bool = True, **_options: Any) citeformer.verify.report.VerificationReport¶
Run NLI-based verification against the cited sources.
Requires the
verifyextra (pip install citeformer[verify]) — the NLI backend is imported lazily on first call.Args: threshold: Entailment probability above which a citation is
supportedand an uncited sentence is flagged as needing a citation. nli: Optional pre-constructedciteformer.verify.NLIModel. IfNone, the default model (DeBERTa-v3-large-MNLI, or whateverCITEFORMER_NLI_MODELis set to) is loaded on first use and cached. run_coverage: If False, skip the NLI coverage check (per-sentence “should this have been cited?” scan). Useful under REQUIRED policy where the grammar guarantees every sentence has a cite.Returns: A
VerificationReportwith per-citation entailment scores, an overall support rate, and uncited-but-entailed flags.Raises: ImportError: If
citeformer[verify]extras aren’t installed. ValueError: If this result was constructed withoutsources(e.g. aschema_version=1serialization that predates the current shape).