Stop trusting citations blindly. Verify them.
CiteSage is a citation integrity verification pipeline for submission and peer review workflows.
Starting from a manuscript (PDF or Word), it automatically extracts references, verifies metadata across multiple academic databases, and audits whether each author claim is actually supported by the cited source — with a full, reproducible evidence trail.
If you've ever submitted a paper with a citation that turned out not to exist, or used an AI writing tool that confidently invented a reference — CiteSage is built for that problem.
Tip
Phase 18 Gold Standard: CiteSage v5.0.1 has achieved absolute standardization through a universal audit (Phase 16-17), LanceDB migration (Phase 17+), and a strict Open Source Tool Selection Policy — featuring a unified output architecture, 100% test coverage for core agents, scientific/agentic memory isolation, and a high-performance Agentic Web Portal (Chainlit).
It does two things:
| Mode | What it does |
|---|---|
| Verify | You give it a paper (PDF/Word). It checks every citation: Does this paper exist? Is the metadata correct? Does the cited content actually match what the author claimed? |
| Curate | You describe a research topic in plain language. CiteSage searches global academic databases, downloads real PDFs, and generates a structured evidence report. |
The problem: LLMs hallucinate citations. Researchers copy wrong metadata. Papers cite retracted work. Traditional citation managers don't verify anything — they just store references.
What CiteSage actually does:
- Checks if the cited DOI exists in CrossRef, OpenAlex, or Semantic Scholar
- Detects retracted papers via OpenAlex and CrossRef signals
- Catches GROBID hallucinations (fake DOIs invented by the PDF parser)
- Compares the author's claim against the actual content of the cited paper
- Flags paywall HTML disguised as PDFs before wasting your time
- Blocks predatory publishers (MDPI etc.) with warnings
- Docker & Docker Compose
- 16GB RAM (E5 embedding model runs locally)
git clone https://github.com/chenweichiang/citesage.git
cd citesage
cp .env.example .env # Add your API keys (OpenAlex, Semantic Scholar)
docker compose up -d # Starts GROBID + Neo4j + Ollama in backgrounddocker compose run --rm verifier your_paper.pdf --auto-fixCiteSage will:
- Extract all citations from your PDF
- Look each one up in multiple academic databases
- Flag any that are missing, wrong, or misrepresented
- Auto-repair broken citations using AI agents
- Generate an HTML report in
output/
docker compose run --rm verifier --curate "large language models in accessibility design" --count 8CiteSage will:
- Search OpenAlex, Semantic Scholar, arXiv, Europe PMC, and CrossRef simultaneously
- Re-rank results using a Cross-Encoder (ms-marco-MiniLM-L-12-v2)
- Download real PDFs from 7 different sources
- Extract key claims and generate evidence cards
- Package everything into
output/<session>.tar.gz
output/
└── verify_your_paper_20260224_154000/
├── reports/
│ └── report.md ← Citation verification results
├── diagnostics/
│ └── doctor_audit.json ← System health + repair log
└── research_data/
└── pdfs/ ← Downloaded PDFs (curate mode)
Each citation gets one of:
- ✅ Verified — exists, metadata matches, content aligns
⚠️ Warning — exists but title/author mismatch, or context uncertain- ❌ Failed — not found in any database, DOI is fake, or paper retracted
Your PDF
│
├─ Layout-aware parsing (IBM Docling)
│ └─ Isolates the References section accurately
│
├─ GROBID citation extraction
│ └─ Hallucination filter: fake DOIs discarded immediately
│
├─ Multi-source lookup (CrossRef + OpenAlex + Semantic Scholar)
│ └─ Retraction shield + predatory publisher filter
│
├─ Semantic content matching (E5 + LLM dual analysis)
│ └─ "Did the author actually use this paper correctly?"
│
└─ Auto-repair (LangChain ReAct Agent)
├─ Tier 1: CrossRef / OpenAlex / Semantic Scholar API lookup (authoritative)
├─ Tier 2: Web search → candidate generation only (not used as ground truth)
└─ Result must resolve to a verifiable DOI or OpenAlex Work ID to be accepted
- Paragraph-level verification (
--paragraph) — checks each in-text citation against its reference - Blind review mode (
--role reviewer) — anonymizes all provenance data for double-blind compliance - Deterministic runs (
--deterministic) — lock random seed for reproducible results
- 5-source parallel search — OpenAlex + Semantic Scholar + arXiv + Europe PMC + CrossRef
- 7-source PDF acquisition — Unpaywall → S2 → CrossRef → WebSearch → CORE → DOAJ → BASE
- Cloudflare bypass —
cloudscraperhandles publisher bot protection - Paywall detection — HEAD pre-check blocks HTML paywall pages before downloading
- RAG-powered retrieval — E5 (1024d) + BM25 + 3-way RRF fusion (LanceDB)
- Graph memory — Neo4j stores paper–author–institution relationships for citation network analysis
- Self-healing — API failures trigger exponential backoff (2/5/10s); Circuit Breaker pauses after 5 consecutive failures
- Session encapsulation — every run packaged into
output/<session>.tar.gzfor reproducibility - Lightning-fast builds —
uvpackage manager for sub-second dependency resolution
Copy .env.example to .env and fill in:
OPENALEX_EMAIL=your@email.com # Polite pool (faster rate limits)
SEMANTIC_SCHOLAR_API_KEY=your_key # Required for high-volume searches
CORE_API_KEY=your_key # Optional: CORE open access# Full test suite (277 tests)
docker compose run --rm --entrypoint "" verifier pytest tests/ -q
# Prompt regression (7/7 passing — after changing Scout/Synthesizer prompts)
npx promptfoo@latest eval --config eval/prompts/scout_intent.yaml
npx promptfoo@latest eval --config eval/prompts/synthesizer.yaml
# RAG system validation
docker compose run --rm --entrypoint "" verifier python eval/rag/unified_rag_test.py| Phase | What was built | Status |
|---|---|---|
| 1–10 | Core verification engine, GROBID, multi-format support, Docker CI | ✅ Done |
| 11 | Hybrid RAG (BM25+Vector RRF), Section-Aware Chunking, Cross-Encoder Reranker | ✅ Done |
| 12 | Grounded generation (sentence-level attribution), Ragas CI gate, Cloudflare e2e | ✅ Done |
| 13 | 5-source search, E5 (1024d), 7-source PDF chain, API infra (KeyRotator + CircuitBreaker) | ✅ Done |
| 14 | Architecture refactoring (CLI/pipeline split, RepairAgent, EmbeddingManager), Docker hardening | ✅ Done |
| 15 | Agentic Web Portal (Chainlit), Temporal Alignment sorting, Evidence Cards visualization | ✅ Done |
| 16 | Output Path Standardization, Log consolidation, metadata_resolver stability audit | ✅ Done |
| 17 | Universal System Audit, Gold Standard normalization, Dual-Push synchronized deployment | ✅ Done |
| 17+ | LanceDB Migration, Shared Memory Integration, Memory Isolation Rules | ✅ Done |
| 18 | Production scaling, advanced analytics | 🔵 Started |
If you use CiteSage in research, this preferred citation is appreciated (not a legal requirement under MIT):
@software{citesage2026,
author = {Chiang, Chenwei},
title = {CiteSage: An Agentic Verification Framework for Academic Integrity},
year = {2026},
url = {https://github.com/chenweichiang/citesage},
note = {v5.0.1, Phase 18}
}MIT License. See LICENSE.
CiteSage is an academic research artifact. For notes on research priority and collaboration, see RESEARCH-NOTICE.md. For the full development history and roadmap, see SYSTEM_PLAN.md.
