citeformer.metadata.pdf

PDF metadata + content extraction.

Two extractor backends:

  • "pypdf" (default, zero-dep): reads the PDF-info metadata (/Title, /Author, /CreationDate, …) and per-page text. Fast and always available; quality depends on the producer of the PDF. Academic PDFs often have these fields set — when they don’t, we return what we have and leave gaps for the caller to fill in.

  • "grobid" (optional, pip install citeformer[grobid]): wraps GROBID, an ML-based scientific- paper parser that returns structured TEI-XML with clean author/title/abstract fields and section-level body text. Requires a GROBID server reachable at grobid_url (defaults to http://localhost:8070). The typical dev setup is::

    docker run -p 8070:8070 grobid/grobid:0.8.0
    

    GROBID extraction is ~5-10× slower than pypdf on a first call but produces substantially cleaner output for downstream NLI scoring — see benchmarks/README.md Finding 3 for the quality gap.

Source.from_pdf(path, extractor="grobid") forwards to this module; direct callers can use :func:extract_pdf and pass extractor=.

Module Contents

Functions

extract_pdf

Extract CSL-JSON metadata + body text from a PDF.

Data

API

citeformer.metadata.pdf.DEFAULT_GROBID_URL

http://localhost:8070

citeformer.metadata.pdf.extract_pdf(path: str | pathlib.Path, *, extractor: Literal[pypdf, grobid] = 'pypdf', grobid_url: str = DEFAULT_GROBID_URL, **kwargs: Any) tuple[dict[str, Any], str]

Extract CSL-JSON metadata + body text from a PDF.

Args: path: Filesystem path to the PDF. extractor: "pypdf" (default; fast, zero-dep) or "grobid" (ML-based, needs a running GROBID server). grobid_url: GROBID server URL. Defaults to http://localhost:8070. Ignored for extractor="pypdf". **kwargs: Extractor-specific options (currently timeout on GROBID).

Returns: (metadata, content). metadata is a CSL-JSON dict with at least id, type, and title. author, issued, and abstract (GROBID only) are included when extractable. content is the joined body text.

Raises: FileNotFoundError: If path doesn’t exist. ImportError: If extractor="grobid" and the grobid extra isn’t installed. RuntimeError: If extractor="grobid" and the server is unreachable / returns non-200. ValueError: If extractor isn’t one of the supported values.