CiteSage

Stop trusting citations blindly. Verify them.

What is CiteSage?

CiteSage is a citation integrity verification pipeline for submission and peer review workflows.

Starting from a manuscript (PDF or Word), it automatically extracts references, verifies metadata across multiple academic databases, and audits whether each author claim is actually supported by the cited source — with a full, reproducible evidence trail.

If you've ever submitted a paper with a citation that turned out not to exist, or used an AI writing tool that confidently invented a reference — CiteSage is built for that problem.

Tip

Phase 18 Gold Standard: CiteSage v5.0.1 has achieved absolute standardization through a universal audit (Phase 16-17), LanceDB migration (Phase 17+), and a strict Open Source Tool Selection Policy — featuring a unified output architecture, 100% test coverage for core agents, scientific/agentic memory isolation, and a high-performance Agentic Web Portal (Chainlit).

It does two things:

Mode	What it does
Verify	You give it a paper (PDF/Word). It checks every citation: Does this paper exist? Is the metadata correct? Does the cited content actually match what the author claimed?
Curate	You describe a research topic in plain language. CiteSage searches global academic databases, downloads real PDFs, and generates a structured evidence report.

Why CiteSage?

The problem: LLMs hallucinate citations. Researchers copy wrong metadata. Papers cite retracted work. Traditional citation managers don't verify anything — they just store references.

What CiteSage actually does:

Checks if the cited DOI exists in CrossRef, OpenAlex, or Semantic Scholar
Detects retracted papers via OpenAlex and CrossRef signals
Catches GROBID hallucinations (fake DOIs invented by the PDF parser)
Compares the author's claim against the actual content of the cited paper
Flags paywall HTML disguised as PDFs before wasting your time
Blocks predatory publishers (MDPI etc.) with warnings

Quick Start

Prerequisites

Docker & Docker Compose
16GB RAM (E5 embedding model runs locally)

Setup

git clone https://github.com/chenweichiang/citesage.git
cd citesage
cp .env.example .env        # Add your API keys (OpenAlex, Semantic Scholar)
docker compose up -d        # Starts GROBID + Neo4j + Ollama in background

Verify a paper's citations

docker compose run --rm verifier your_paper.pdf --auto-fix

CiteSage will:

Extract all citations from your PDF
Look each one up in multiple academic databases
Flag any that are missing, wrong, or misrepresented
Auto-repair broken citations using AI agents
Generate an HTML report in output/

Find papers on a topic

docker compose run --rm verifier --curate "large language models in accessibility design" --count 8

CiteSage will:

Search OpenAlex, Semantic Scholar, arXiv, Europe PMC, and CrossRef simultaneously
Re-rank results using a Cross-Encoder (ms-marco-MiniLM-L-12-v2)
Download real PDFs from 7 different sources
Extract key claims and generate evidence cards
Package everything into output/<session>.tar.gz

What the output looks like

output/
└── verify_your_paper_20260224_154000/
    ├── reports/
    │   └── report.md           ← Citation verification results
    ├── diagnostics/
    │   └── doctor_audit.json   ← System health + repair log
    └── research_data/
        └── pdfs/               ← Downloaded PDFs (curate mode)

Each citation gets one of:

✅ Verified — exists, metadata matches, content aligns
⚠️ Warning — exists but title/author mismatch, or context uncertain
❌ Failed — not found in any database, DOI is fake, or paper retracted

How it works

Your PDF
  │
  ├─ Layout-aware parsing (IBM Docling)
  │   └─ Isolates the References section accurately
  │
  ├─ GROBID citation extraction
  │   └─ Hallucination filter: fake DOIs discarded immediately
  │
  ├─ Multi-source lookup (CrossRef + OpenAlex + Semantic Scholar)
  │   └─ Retraction shield + predatory publisher filter
  │
  ├─ Semantic content matching (E5 + LLM dual analysis)
  │   └─ "Did the author actually use this paper correctly?"
  │
  └─ Auto-repair (LangChain ReAct Agent)
      ├─ Tier 1: CrossRef / OpenAlex / Semantic Scholar API lookup (authoritative)
      ├─ Tier 2: Web search → candidate generation only (not used as ground truth)
      └─ Result must resolve to a verifiable DOI or OpenAlex Work ID to be accepted

Key features

For researchers verifying their own papers

Paragraph-level verification (--paragraph) — checks each in-text citation against its reference
Blind review mode (--role reviewer) — anonymizes all provenance data for double-blind compliance
Deterministic runs (--deterministic) — lock random seed for reproducible results

For systematic reviews and literature discovery

5-source parallel search — OpenAlex + Semantic Scholar + arXiv + Europe PMC + CrossRef
7-source PDF acquisition — Unpaywall → S2 → CrossRef → WebSearch → CORE → DOAJ → BASE
Cloudflare bypass — cloudscraper handles publisher bot protection
Paywall detection — HEAD pre-check blocks HTML paywall pages before downloading

Under the hood

RAG-powered retrieval — E5 (1024d) + BM25 + 3-way RRF fusion (LanceDB)
Graph memory — Neo4j stores paper–author–institution relationships for citation network analysis
Self-healing — API failures trigger exponential backoff (2/5/10s); Circuit Breaker pauses after 5 consecutive failures
Session encapsulation — every run packaged into output/<session>.tar.gz for reproducibility
Lightning-fast builds — uv package manager for sub-second dependency resolution

Configuration

Copy .env.example to .env and fill in:

OPENALEX_EMAIL=your@email.com          # Polite pool (faster rate limits)
SEMANTIC_SCHOLAR_API_KEY=your_key      # Required for high-volume searches
CORE_API_KEY=your_key                  # Optional: CORE open access

Running tests

# Full test suite (277 tests)
docker compose run --rm --entrypoint "" verifier pytest tests/ -q

# Prompt regression (7/7 passing — after changing Scout/Synthesizer prompts)
npx promptfoo@latest eval --config eval/prompts/scout_intent.yaml
npx promptfoo@latest eval --config eval/prompts/synthesizer.yaml

# RAG system validation
docker compose run --rm --entrypoint "" verifier python eval/rag/unified_rag_test.py

Roadmap

Phase	What was built	Status
1–10	Core verification engine, GROBID, multi-format support, Docker CI	✅ Done
11	Hybrid RAG (BM25+Vector RRF), Section-Aware Chunking, Cross-Encoder Reranker	✅ Done
12	Grounded generation (sentence-level attribution), Ragas CI gate, Cloudflare e2e	✅ Done
13	5-source search, E5 (1024d), 7-source PDF chain, API infra (KeyRotator + CircuitBreaker)	✅ Done
14	Architecture refactoring (CLI/pipeline split, RepairAgent, EmbeddingManager), Docker hardening	✅ Done
15	Agentic Web Portal (Chainlit), Temporal Alignment sorting, Evidence Cards visualization	✅ Done
16	Output Path Standardization, Log consolidation, metadata_resolver stability audit	✅ Done
17	Universal System Audit, Gold Standard normalization, Dual-Push synchronized deployment	✅ Done
17+	LanceDB Migration, Shared Memory Integration, Memory Isolation Rules	✅ Done
18	Production scaling, advanced analytics	🔵 Started

Academic Attribution

If you use CiteSage in research, this preferred citation is appreciated (not a legal requirement under MIT):

@software{citesage2026,
  author = {Chiang, Chenwei},
  title  = {CiteSage: An Agentic Verification Framework for Academic Integrity},
  year   = {2026},
  url    = {https://github.com/chenweichiang/citesage},
  note   = {v5.0.1, Phase 18}
}

License

MIT License. See LICENSE.

CiteSage is an academic research artifact. For notes on research priority and collaboration, see RESEARCH-NOTICE.md. For the full development history and roadmap, see SYSTEM_PLAN.md.

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.agents/workflows		.agents/workflows
.chainlit		.chainlit
.github/workflows		.github/workflows
agent		agent
assets		assets
bench		bench
ci		ci
citesage_logging		citesage_logging
config		config
datasets		datasets
docs		docs
eval		eval
examples		examples
experiments		experiments
legacy_reqs		legacy_reqs
memory_governance		memory_governance
report_viewer		report_viewer
request_gateway		request_gateway
schema		schema
scripts		scripts
security_privacy		security_privacy
semantic_memory		semantic_memory
tests		tests
verifier		verifier
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
GEMINI.md		GEMINI.md
README.md		README.md
SYSTEM_PLAN.md		SYSTEM_PLAN.md
chainlit.md		chainlit.md
cli.py		cli.py
docker-compose.yml		docker-compose.yml
pipeline.py		pipeline.py
promptfooconfig.yaml		promptfooconfig.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock
verify_citations.py		verify_citations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CiteSage

What is CiteSage?

Why CiteSage?

Quick Start

Prerequisites

Setup

Verify a paper's citations

Find papers on a topic

What the output looks like

How it works

Key features

For researchers verifying their own papers

For systematic reviews and literature discovery

Under the hood

Configuration

Running tests

Roadmap

Academic Attribution

License

Test Change

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CiteSage

What is CiteSage?

Why CiteSage?

Quick Start

Prerequisites

Setup

Verify a paper's citations

Find papers on a topic

What the output looks like

How it works

Key features

For researchers verifying their own papers

For systematic reviews and literature discovery

Under the hood

Configuration

Running tests

Roadmap

Academic Attribution

License

Test Change

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages