citeformer.metadata.url

URL metadata + content extraction.

readability-lxml finds the article body; raw lxml pulls OpenGraph / Twitter / article meta tags for title / author / date / site-name. The returned CSL-JSON uses type: "webpage" — users who know the content is actually an article or paper should override after calling from_url.

Module Contents

Functions

extract_url

Fetch a URL and extract CSL-JSON metadata + article text.

API

citeformer.metadata.url.extract_url(url: str, *, timeout: float = _DEFAULT_TIMEOUT) tuple[dict[str, Any], str]

Fetch a URL and extract CSL-JSON metadata + article text.

Args: url: HTTP(S) URL. timeout: HTTP timeout in seconds.

Returns: (metadata, content). metadata always includes id, type: "webpage", URL, and title (falls back to the URL). author, issued, and container-title are included when meta tags (OpenGraph / Twitter / article:*) provide them. content is the plain-text form of the readability-extracted article body.

Raises: httpx.HTTPStatusError: On HTTP non-2xx.