citeformer.integrations.langchain

LangChain ↔ citeformer adapter.

LangChain’s retrieval story produces List[Document] — each with page_content (the chunk text) and metadata (free-form dict). To feed those into Citeformer.generate we need to convert each Document to a Source with CSL-JSON-shaped metadata.

Duck-typed: we don’t import LangChain at module load, so you can use these functions with anything that has page_content: str + metadata: dict attributes — LangChain’s Document, a mock, a pydantic model, whatever.

Typical usage::

from citeformer import Citeformer
from citeformer.integrations.langchain import sources_from_documents
from citeformer.backends.hf import HFBackend

docs = retriever.get_relevant_documents(query)   # LangChain retriever
sources = sources_from_documents(docs)

cf = Citeformer(backend=HFBackend("gpt2"))
result = cf.generate(prompt=query, sources=sources)

If your retrieved documents have rich metadata (a Zotero library, a Crossref-backed vectorstore), pass metadata_converter= to map from your custom shape to CSL-JSON. The default converter produces a minimal-but-valid CSL item ({id, type: 'webpage', title}) from whatever is in Document.metadata.

Module Contents

Functions

default_metadata_converter

Fallback conversion from LangChain-style metadata to CSL-JSON.

source_from_document

Convert one LangChain-shaped Document into a citeformer Source.

sources_from_documents

Convert an iterable of LangChain documents to citeformer sources.

Data

API

citeformer.integrations.langchain.MetadataConverter

None

citeformer.integrations.langchain.default_metadata_converter(metadata: dict[str, Any]) dict[str, Any]

Fallback conversion from LangChain-style metadata to CSL-JSON.

Pulls common keys the LangChain ecosystem uses (title, source, url, author) and packages them as a minimal CSL-JSON {id, type: 'webpage', title, URL?} item. Unknown keys are kept under _langchain_metadata so downstream code can still access them if needed.

citeformer.integrations.langchain.source_from_document(document: citeformer.integrations.langchain._DocumentLike, *, metadata_converter: citeformer.integrations.langchain.MetadataConverter | None = None) citeformer.core.Source

Convert one LangChain-shaped Document into a citeformer Source.

Args: document: Object with page_content: str + metadata: dict attributes. LangChain’s langchain_core.documents.Document is the canonical shape; any duck-typed equivalent works. metadata_converter: Optional override for the default CSL-JSON conversion. Signature: (dict) -> dict. Useful when your retrieved documents come from a rich source (Zotero, Crossref-backed vectorstore) and you want to preserve that.

Returns: A Source with content = document.page_content and metadata shaped as CSL-JSON.

Raises: TypeError: If document doesn’t have the expected attributes.

citeformer.integrations.langchain.sources_from_documents(documents: collections.abc.Iterable[citeformer.integrations.langchain._DocumentLike], *, metadata_converter: citeformer.integrations.langchain.MetadataConverter | None = None) list[citeformer.core.Source]

Convert an iterable of LangChain documents to citeformer sources.

Preserves order; downstream citation ids correspond 1:1 with list position, which matches how LangChain retrievers return their relevance-ordered results.