citeformer.verify.sentences¶
Sentence splitter for verification paths.
Verification needs to identify per-sentence char spans so that:
Each
Citationcan be associated with “the sentence containing this marker”.Uncited sentences can be scored against every source for coverage checks.
We avoid heavy NLP dependencies (nltk with punkt download, spacy) and emit
spans via a small regex-based splitter. This handles the common cases — ASCII
and Unicode terminators, multiple terminators (!?, !!), abbreviations
common enough to skip (Dr., et al., e.g., i.e.). It will
mis-split on exotic cases (abbreviated initials in names, URLs with dots);
that’s an accepted limitation for v0.1.
Trade-off discussion lives in the verification docs
(docs/verification.md#limitations).
Module Contents¶
Classes¶
One sentence extracted from a text, carrying its char offsets. |
Functions¶
Split |
|
Return the |
|
Remove |
API¶
- class citeformer.verify.sentences.SentenceSpan¶
One sentence extracted from a text, carrying its char offsets.
Attributes: index: 0-indexed position among the sentences in the source text. start: Inclusive char offset into the original text. end: Exclusive char offset into the original text. text: The sentence slice (stripped of leading/trailing whitespace).
- citeformer.verify.sentences.split_sentences(text: str) list[citeformer.verify.sentences.SentenceSpan]¶
Split
textinto sentence spans.Spans are returned in source order and cover the full text (modulo leading / trailing whitespace). Empty / whitespace-only inputs return an empty list.
Args: text: The text to split.
Returns: A list of
SentenceSpanrecords.
- citeformer.verify.sentences.sentence_containing(spans: list[citeformer.verify.sentences.SentenceSpan], char_offset: int) citeformer.verify.sentences.SentenceSpan | None¶
Return the
SentenceSpancontainingchar_offset, or None if not found.Handy for mapping a
Citation.spanto the sentence it belongs to.