citeformer — Open Source | Blaise Albis-Burdige

What It Is

citeformer makes citation fabrication structurally impossible. Before a language model picks its next token, citeformer compiles a tiny grammar that only admits citation markers pointing at sources you actually supplied, and hands that grammar to the decoder via XGrammar / llguidance / GBNF (locally) or strict structured outputs (across modern API providers). Out-of-scope [N] tokens get masked to zero probability before sampling — the sampler never sees them. Bibliographies are rendered deterministically by the library in six academic styles (APA-7, MLA, Chicago, IEEE, Nature, Vancouver), and every emitted claim can be NLI-verified against its cited source after the fact.

Why It Matters

LLM-generated citations are wrong 14–95% of the time depending on the benchmark; RAG systems still fabricate 3–13% of cited URLs; NeurIPS 2025 accepted ~50 papers with AI-generated fake references. Prompting doesn't fix it; post-hoc verification doesn't fix it. The only real fix is structural — make the invalid output token-impossible before the model reaches the decision point. citeformer delivers that contract across ten backends (HF, vLLM, llama.cpp, OpenAI, Anthropic, Gemini, Mistral, Fireworks, OpenRouter, Together), proven across a 40-run multi-prompt sweep at 0.0 ± 0.0 fabrication — the std is identically zero because the guarantee is a contract, not a mean.

What Ships in 0.3

Logit-masked GBNF — the cite-id terminal is compiled per call to "[" ("1" | "2" | ... | "N") "]" and handed to XGrammar (default) or llguidance; out-of-scope tokens get masked before sampling
Ten backends, two enforcement loci, one GenerationResult — local backends (HF, vLLM, llama.cpp) enforce in-process; API backends (OpenAI, Anthropic, Gemini, Mistral, OpenRouter, Together) enforce inside the provider runtime via strict structured outputs; Fireworks accepts citeformer's GBNF natively unchanged
Six hand-written CSL formatters — APA-7, MLA, Chicago, IEEE, Nature, Vancouver — ~1 kLOC, no citeproc-py dependency, 300 locked snapshots pin formatter outputs
result.verify() — DeBERTa-v3-large-MNLI entailment per (source, cited sentence) pair, returning a typed VerificationReport with coverage check for uncited-but-entailed sentences
Source adapters — Source.from_doi(...), Source.from_arxiv(...), raw-content Source(metadata=..., content=...); httpx + pypdf + GROBID + readability for fetch and parse
Streaming — token-level streaming preserved across all backends; structural guarantee holds mid-stream
HF Space demo under hf-space/ runs the adversarial "100% → 0% fabrication" swing on CPU in a browser

Ecosystem Fit

citeformer slots into any RAG pipeline that emits cited prose: drop it in front of the LLM call, hand it your Source list, and result.text is guaranteed not to contain [N] for N > len(sources). Apache-2.0, Python 3.11+ (tested through 3.14). The literature-review notebook (examples/08_literature_review.ipynb) walks end-to-end from arXiv fetch → grammar-constrained generation → NLI verification → APA-7 bibliography on a laptop-friendly 500 MB model.

When to Reach For It

Any RAG pipeline where fabricated citations are a correctness failure, not a UX nit
Academic / research workflows that need deterministic bibliography rendering across styles
Audit-grade applications where every [N] must be traceable and entailment-verifiable
Replacing the "regex + retry" citation-validation loop with a pre-sampling structural guarantee