citeformer.metadata.url¶
URL metadata + content extraction.
readability-lxml finds the article body; raw lxml pulls OpenGraph /
Twitter / article meta tags for title / author / date / site-name. The
returned CSL-JSON uses type: "webpage" — users who know the content is
actually an article or paper should override after calling from_url.
Module Contents¶
Functions¶
Fetch a URL and extract CSL-JSON metadata + article text. |
API¶
- citeformer.metadata.url.extract_url(url: str, *, timeout: float = _DEFAULT_TIMEOUT) tuple[dict[str, Any], str]¶
Fetch a URL and extract CSL-JSON metadata + article text.
Args: url: HTTP(S) URL. timeout: HTTP timeout in seconds.
Returns:
(metadata, content).metadataalways includesid,type: "webpage",URL, andtitle(falls back to the URL).author,issued, andcontainer-titleare included when meta tags (OpenGraph / Twitter / article:*) provide them.contentis the plain-text form of the readability-extracted article body.Raises: httpx.HTTPStatusError: On HTTP non-2xx.