Skip to content

chenweichiang/citesage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

201 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CiteSage Logo

CiteSage

Stop trusting citations blindly. Verify them.

繁體中文 · GitHub


What is CiteSage?

CiteSage is a citation integrity verification pipeline for submission and peer review workflows.

Starting from a manuscript (PDF or Word), it automatically extracts references, verifies metadata across multiple academic databases, and audits whether each author claim is actually supported by the cited source — with a full, reproducible evidence trail.

If you've ever submitted a paper with a citation that turned out not to exist, or used an AI writing tool that confidently invented a reference — CiteSage is built for that problem.

Tip

Phase 18 Gold Standard: CiteSage v5.0.1 has achieved absolute standardization through a universal audit (Phase 16-17), LanceDB migration (Phase 17+), and a strict Open Source Tool Selection Policy — featuring a unified output architecture, 100% test coverage for core agents, scientific/agentic memory isolation, and a high-performance Agentic Web Portal (Chainlit).

It does two things:

Mode What it does
Verify You give it a paper (PDF/Word). It checks every citation: Does this paper exist? Is the metadata correct? Does the cited content actually match what the author claimed?
Curate You describe a research topic in plain language. CiteSage searches global academic databases, downloads real PDFs, and generates a structured evidence report.

Why CiteSage?

The problem: LLMs hallucinate citations. Researchers copy wrong metadata. Papers cite retracted work. Traditional citation managers don't verify anything — they just store references.

What CiteSage actually does:

  • Checks if the cited DOI exists in CrossRef, OpenAlex, or Semantic Scholar
  • Detects retracted papers via OpenAlex and CrossRef signals
  • Catches GROBID hallucinations (fake DOIs invented by the PDF parser)
  • Compares the author's claim against the actual content of the cited paper
  • Flags paywall HTML disguised as PDFs before wasting your time
  • Blocks predatory publishers (MDPI etc.) with warnings

Quick Start

Prerequisites

  • Docker & Docker Compose
  • 16GB RAM (E5 embedding model runs locally)

Setup

git clone https://github.com/chenweichiang/citesage.git
cd citesage
cp .env.example .env        # Add your API keys (OpenAlex, Semantic Scholar)
docker compose up -d        # Starts GROBID + Neo4j + Ollama in background

Verify a paper's citations

docker compose run --rm verifier your_paper.pdf --auto-fix

CiteSage will:

  1. Extract all citations from your PDF
  2. Look each one up in multiple academic databases
  3. Flag any that are missing, wrong, or misrepresented
  4. Auto-repair broken citations using AI agents
  5. Generate an HTML report in output/

Find papers on a topic

docker compose run --rm verifier --curate "large language models in accessibility design" --count 8

CiteSage will:

  1. Search OpenAlex, Semantic Scholar, arXiv, Europe PMC, and CrossRef simultaneously
  2. Re-rank results using a Cross-Encoder (ms-marco-MiniLM-L-12-v2)
  3. Download real PDFs from 7 different sources
  4. Extract key claims and generate evidence cards
  5. Package everything into output/<session>.tar.gz

What the output looks like

output/
└── verify_your_paper_20260224_154000/
    ├── reports/
    │   └── report.md           ← Citation verification results
    ├── diagnostics/
    │   └── doctor_audit.json   ← System health + repair log
    └── research_data/
        └── pdfs/               ← Downloaded PDFs (curate mode)

Each citation gets one of:

  • Verified — exists, metadata matches, content aligns
  • ⚠️ Warning — exists but title/author mismatch, or context uncertain
  • Failed — not found in any database, DOI is fake, or paper retracted

How it works

Your PDF
  │
  ├─ Layout-aware parsing (IBM Docling)
  │   └─ Isolates the References section accurately
  │
  ├─ GROBID citation extraction
  │   └─ Hallucination filter: fake DOIs discarded immediately
  │
  ├─ Multi-source lookup (CrossRef + OpenAlex + Semantic Scholar)
  │   └─ Retraction shield + predatory publisher filter
  │
  ├─ Semantic content matching (E5 + LLM dual analysis)
  │   └─ "Did the author actually use this paper correctly?"
  │
  └─ Auto-repair (LangChain ReAct Agent)
      ├─ Tier 1: CrossRef / OpenAlex / Semantic Scholar API lookup (authoritative)
      ├─ Tier 2: Web search → candidate generation only (not used as ground truth)
      └─ Result must resolve to a verifiable DOI or OpenAlex Work ID to be accepted

Key features

For researchers verifying their own papers

  • Paragraph-level verification (--paragraph) — checks each in-text citation against its reference
  • Blind review mode (--role reviewer) — anonymizes all provenance data for double-blind compliance
  • Deterministic runs (--deterministic) — lock random seed for reproducible results

For systematic reviews and literature discovery

  • 5-source parallel search — OpenAlex + Semantic Scholar + arXiv + Europe PMC + CrossRef
  • 7-source PDF acquisition — Unpaywall → S2 → CrossRef → WebSearch → CORE → DOAJ → BASE
  • Cloudflare bypasscloudscraper handles publisher bot protection
  • Paywall detection — HEAD pre-check blocks HTML paywall pages before downloading

Under the hood

  • RAG-powered retrieval — E5 (1024d) + BM25 + 3-way RRF fusion (LanceDB)
  • Graph memory — Neo4j stores paper–author–institution relationships for citation network analysis
  • Self-healing — API failures trigger exponential backoff (2/5/10s); Circuit Breaker pauses after 5 consecutive failures
  • Session encapsulation — every run packaged into output/<session>.tar.gz for reproducibility
  • Lightning-fast buildsuv package manager for sub-second dependency resolution

Configuration

Copy .env.example to .env and fill in:

OPENALEX_EMAIL=your@email.com          # Polite pool (faster rate limits)
SEMANTIC_SCHOLAR_API_KEY=your_key      # Required for high-volume searches
CORE_API_KEY=your_key                  # Optional: CORE open access

Running tests

# Full test suite (277 tests)
docker compose run --rm --entrypoint "" verifier pytest tests/ -q

# Prompt regression (7/7 passing — after changing Scout/Synthesizer prompts)
npx promptfoo@latest eval --config eval/prompts/scout_intent.yaml
npx promptfoo@latest eval --config eval/prompts/synthesizer.yaml

# RAG system validation
docker compose run --rm --entrypoint "" verifier python eval/rag/unified_rag_test.py

Roadmap

Phase What was built Status
1–10 Core verification engine, GROBID, multi-format support, Docker CI ✅ Done
11 Hybrid RAG (BM25+Vector RRF), Section-Aware Chunking, Cross-Encoder Reranker ✅ Done
12 Grounded generation (sentence-level attribution), Ragas CI gate, Cloudflare e2e ✅ Done
13 5-source search, E5 (1024d), 7-source PDF chain, API infra (KeyRotator + CircuitBreaker) ✅ Done
14 Architecture refactoring (CLI/pipeline split, RepairAgent, EmbeddingManager), Docker hardening ✅ Done
15 Agentic Web Portal (Chainlit), Temporal Alignment sorting, Evidence Cards visualization ✅ Done
16 Output Path Standardization, Log consolidation, metadata_resolver stability audit ✅ Done
17 Universal System Audit, Gold Standard normalization, Dual-Push synchronized deployment ✅ Done
17+ LanceDB Migration, Shared Memory Integration, Memory Isolation Rules ✅ Done
18 Production scaling, advanced analytics 🔵 Started

Academic Attribution

If you use CiteSage in research, this preferred citation is appreciated (not a legal requirement under MIT):

@software{citesage2026,
  author = {Chiang, Chenwei},
  title  = {CiteSage: An Agentic Verification Framework for Academic Integrity},
  year   = {2026},
  url    = {https://github.com/chenweichiang/citesage},
  note   = {v5.0.1, Phase 18}
}

License

MIT License. See LICENSE.

CiteSage is an academic research artifact. For notes on research priority and collaboration, see RESEARCH-NOTICE.md. For the full development history and roadmap, see SYSTEM_PLAN.md.

Test Change

About

Automated citation verification system with Agentic self-learning. Validates academic metadata and semantic consistency using ML and multi-source databases. 具備 Agentic 自學能力的自動化學術引用驗證系統。整合多方資料庫與 ML 技術,精準確保文獻元資料及語義內容的一致性。

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages