citeformer.verify.nli

Natural-language-inference backend for verification.

We wrap a DeBERTa-v3 MNLI model via transformers. The model is lazy-loaded on first entail() call, cached globally per (model_name, device) so multiple Verifier instances share weights. Batched scoring is the common path — single-pair calls funnel through the batched API with a one-element batch.

Default model: MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli (~850 MB; well-tested on scientific claims). Override via the nli_model kwarg on Verifier. A smaller / faster default can be swapped in at build time by setting the CITEFORMER_NLI_MODEL env var.

Long premises (>512 tokens) can be chunked (opt-in): we slide a fixed-size window over the premise, score each chunk against the hypothesis, and take the maximum entailment as the pair’s result. That surfaces claim-to-source entailment that lives past the first 512 tokens — useful when scoring against full PDF body text. But max-over- windows also inflates false positives on unrelated claims (each extra window is another chance for noise to cross the threshold), so we keep it off by default for score stability and enable it explicitly via chunk_premise=True when the caller wants long-document scoring. When chunking is on, consider raising threshold on the Verifier (0.7–0.8 rather than 0.5) to compensate for the max-reduction bias.

Requires the verify extra: pip install citeformer[verify].

Module Contents

Classes

NLIResult

One NLI scoring outcome for a (premise, hypothesis) pair.

NLIModel

DeBERTa-v3-MNLI (or drop-in compatible) NLI scorer.

Data

API

citeformer.verify.nli.DEFAULT_NLI_MODEL

‘get(…)’

class citeformer.verify.nli.NLIResult

One NLI scoring outcome for a (premise, hypothesis) pair.

Attributes: entailment: Probability of the entailment class in [0, 1]. neutral: Probability of the neutral class. contradiction: Probability of the contradiction class.

entailment: float

None

neutral: float

None

contradiction: float

None

property supports: bool

True if the entailment class is the predicted label.

Equivalent to entailment > max(neutral, contradiction); thresholded use sites that want a hard cutoff should compare entailment to a configured threshold directly.

class citeformer.verify.nli.NLIModel(model_name: str = DEFAULT_NLI_MODEL, *, device: str | None = None, batch_size: int = 8, chunk_premise: bool = False, max_premise_tokens: int = _DEFAULT_MAX_PREMISE_TOKENS, chunk_stride: int = _DEFAULT_CHUNK_STRIDE)

DeBERTa-v3-MNLI (or drop-in compatible) NLI scorer.

Instances are cheap to construct; weights are loaded on first entail(). The transformers model + tokenizer are cached globally per (model_name, device) via functools.lru_cache so multiple NLIModel instances with identical config share a single GPU residence.

Attributes: model_name: HuggingFace model identifier. device: Torch device (cuda / mps / cpu) resolved at construction. batch_size: Max pairs to score in a single forward pass. chunk_premise: When True, long premises are split into overlapping windows; max entailment across windows is the pair’s result. Default is False — max-over-windows inflates false positives on unrelated claims. Enable for long-document scoring with a bumped threshold on the Verifier (0.7+) to compensate. max_premise_tokens: Window size in tokens. Default 400 (leaves room for the hypothesis + special tokens inside DeBERTa’s 512 cap). chunk_stride: Token stride between windows. Default 300; overlap = max_premise_tokens - stride.

Initialization

Construct an NLIModel.

Args: model_name: HF identifier (e.g. "MoritzLaurer/DeBERTa-…"). device: None auto-detects CUDA > MPS > CPU. batch_size: Max pairs per forward pass; adjust down on low-VRAM hardware. chunk_premise: If True (default), long premises are chunked and scored with max-entailment reduction. Set to False for raw truncation at max_premise_tokens + hypothesis. max_premise_tokens: Window size when chunking. 400 is a safe default under DeBERTa’s 512-token limit. chunk_stride: Stride between windows. Lower = more overlap = slower but more thorough.

Raises: ImportError: If citeformer[verify] extras aren’t installed. ValueError: If chunk_stride >= max_premise_tokens (would make windows non-overlapping or skip content).

model_name: str

None

device: str

None

batch_size: int

None

chunk_premise: bool

None

max_premise_tokens: int

None

chunk_stride: int

None

entail(premise: str, hypothesis: str) citeformer.verify.nli.NLIResult

Score a single (premise, hypothesis) pair.

Uses chunked scoring when chunk_premise is enabled and the premise is long enough to benefit.

Args: premise: The evidence / source text. hypothesis: The claim being checked against the premise.

Returns: An NLIResult with per-class probabilities.

entail_batch(pairs: list[tuple[str, str]]) list[citeformer.verify.nli.NLIResult]

Score a list of (premise, hypothesis) pairs in batches.

Empty input returns an empty list. Uses chunked scoring when the model’s chunk_premise is True; otherwise falls back to the naive 512-token truncation path.

Args: pairs: A list of (premise, hypothesis) tuples.

Returns: Results in the same order as input.