citeformer.metadata.pdf¶
PDF metadata + content extraction.
Two extractor backends:
"pypdf"(default, zero-dep): reads the PDF-info metadata (/Title,/Author,/CreationDate, …) and per-page text. Fast and always available; quality depends on the producer of the PDF. Academic PDFs often have these fields set — when they don’t, we return what we have and leave gaps for the caller to fill in."grobid"(optional,pip install citeformer[grobid]): wraps GROBID, an ML-based scientific- paper parser that returns structured TEI-XML with clean author/title/abstract fields and section-level body text. Requires a GROBID server reachable atgrobid_url(defaults tohttp://localhost:8070). The typical dev setup is::docker run -p 8070:8070 grobid/grobid:0.8.0
GROBID extraction is ~5-10× slower than pypdf on a first call but produces substantially cleaner output for downstream NLI scoring — see
benchmarks/README.mdFinding 3 for the quality gap.
Source.from_pdf(path, extractor="grobid") forwards to this module;
direct callers can use :func:extract_pdf and pass extractor=.
Module Contents¶
Functions¶
Extract CSL-JSON metadata + body text from a PDF. |
Data¶
API¶
- citeformer.metadata.pdf.DEFAULT_GROBID_URL¶
- citeformer.metadata.pdf.extract_pdf(path: str | pathlib.Path, *, extractor: Literal[pypdf, grobid] = 'pypdf', grobid_url: str = DEFAULT_GROBID_URL, **kwargs: Any) tuple[dict[str, Any], str]¶
Extract CSL-JSON metadata + body text from a PDF.
Args: path: Filesystem path to the PDF. extractor:
"pypdf"(default; fast, zero-dep) or"grobid"(ML-based, needs a running GROBID server). grobid_url: GROBID server URL. Defaults tohttp://localhost:8070. Ignored forextractor="pypdf". **kwargs: Extractor-specific options (currentlytimeouton GROBID).Returns:
(metadata, content).metadatais a CSL-JSON dict with at leastid,type, andtitle.author,issued, andabstract(GROBID only) are included when extractable.contentis the joined body text.Raises: FileNotFoundError: If
pathdoesn’t exist. ImportError: Ifextractor="grobid"and thegrobidextra isn’t installed. RuntimeError: Ifextractor="grobid"and the server is unreachable / returns non-200. ValueError: Ifextractorisn’t one of the supported values.