citeformer.verify.sentences

Sentence splitter for verification paths.

Verification needs to identify per-sentence char spans so that:

  1. Each Citation can be associated with “the sentence containing this marker”.

  2. Uncited sentences can be scored against every source for coverage checks.

We avoid heavy NLP dependencies (nltk with punkt download, spacy) and emit spans via a small regex-based splitter. This handles the common cases — ASCII and Unicode terminators, multiple terminators (!?, !!), abbreviations common enough to skip (Dr., et al., e.g., i.e.). It will mis-split on exotic cases (abbreviated initials in names, URLs with dots); that’s an accepted limitation for v0.1.

Trade-off discussion lives in the verification docs (docs/verification.md#limitations).

Module Contents

Classes

SentenceSpan

One sentence extracted from a text, carrying its char offsets.

Functions

split_sentences

Split text into sentence spans.

sentence_containing

Return the SentenceSpan containing char_offset, or None if not found.

strip_citation_markers

Remove [N] style citation markers from text.

API

class citeformer.verify.sentences.SentenceSpan

One sentence extracted from a text, carrying its char offsets.

Attributes: index: 0-indexed position among the sentences in the source text. start: Inclusive char offset into the original text. end: Exclusive char offset into the original text. text: The sentence slice (stripped of leading/trailing whitespace).

index: int

None

start: int

None

end: int

None

text: str

None

citeformer.verify.sentences.split_sentences(text: str) list[citeformer.verify.sentences.SentenceSpan]

Split text into sentence spans.

Spans are returned in source order and cover the full text (modulo leading / trailing whitespace). Empty / whitespace-only inputs return an empty list.

Args: text: The text to split.

Returns: A list of SentenceSpan records.

citeformer.verify.sentences.sentence_containing(spans: list[citeformer.verify.sentences.SentenceSpan], char_offset: int) citeformer.verify.sentences.SentenceSpan | None

Return the SentenceSpan containing char_offset, or None if not found.

Handy for mapping a Citation.span to the sentence it belongs to.

citeformer.verify.sentences.strip_citation_markers(text: str) str

Remove [N] style citation markers from text.

Leading spaces before the marker are consumed to avoid leaving double-spaces. Preserves trailing punctuation.