A structural audit tool for Excel-based financial models. It reads an Excel workbook, maps every cell and formula into a dependency graph, runs deterministic checks for common modeling errors (hard-coded plugs, broken references, balance sheet imbalances), and produces a PDF memo and Excel report of findings. An optional LLM layer can generate narrative summaries — but the core audit is purely rule-based and does not require any API keys.
This is a continually developed project. Features, interfaces, and test coverage expand over time as new research ideas and workflow needs arise.
Key questions this project answers:
- Does this Excel model have structural errors that could affect the output?
- Where are the hard-coded overrides hiding in this workbook?
./run.sh # setup + launch Streamlit appOr manually:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Streamlit dashboard
streamlit run app.py
# Upload sample_models/BOBWEIR_Model.xlsx to test
# CLI
python main.py # runs against sample model
python main.py path/to/your_model.xlsx # runs against any modelRun ./run.sh help for all commands (setup, app, cli, test).
pip install anthropic python-dotenv # or: pip install openai python-dotenv
cp .env.example .env
# Edit .env and add your API keyThe audit runs in four phases:
╔═══════════════════════════════════════════════════════════╗
║ EXCEL FILE ║
║ .xlsx workbook upload ║
╚═══════════════════════════════╤═══════════════════════════╝
▼
┌─ PHASE 1 ── Ingestion ───────────────────────────────────┐
│ Dual-state loading (values + formulas) │
└───────────────────────────────┬───────────────────────────┘
▼
┌─ PHASE 2 ── Dependency Graph ────────────────────────────┐
│ networkx DAG of cell references and cross-sheet flows │
└───────────────────────────────┬───────────────────────────┘
▼
┌─ PHASE 3 ── Audit Checks ───────────────────────────────┐
│ Deterministic heuristics (plugs, BS integrity, refs) │
└───────────────────────────────┬───────────────────────────┘
▼
┌─ PHASE 4 ── Reporting ──────────────────────────────────┐
│ Complexity score, issue catalog, remediation │
│ (PDF memo + Excel datatape) │
└──────────────────────────────────────────────────────────┘
-
Ingestion — Loads the workbook twice: once for calculated values, once for raw formulas. This enables both numerical checks and logical tracing.
-
Dependency graph — Parses every formula into a directed graph using networkx. Each cell is a node; each reference is an edge. This reveals circular references, orphaned cells, and cross-sheet data flows.
-
Audit checks — Three core checks run against the graph and values:
- Hard-coded plugs — Rows where most cells are formulas but some projection-period cells are constants (analysts overriding the model).
- Balance sheet integrity — Checks that Assets = Liabilities + Equity across all projection periods (tolerance: ±$1).
- Broken/external references — Detects
#REF!,#NAME?,#DIV/0!, and links to external files.
-
Reporting — Produces a complexity score (1–5 based on sheet count, formula density, and interdependency), a PDF memo, and an Excel datatape with all findings.
┌─ CONTROL ── Audit Engine ────────────────────────────────┐
│ Deterministic checks (heuristic-based) │
└───────────────────────────────┬───────────────────────────┘
▼
┌─ REASONING ── LLM Analyzer ─────────────────────────────┐
│ Narrative summary of findings (scoped, no advice) │
└───────────────────────────────┬───────────────────────────┘
▼
┌─ DECISION ── Human Review ──────────────────────────────┐
│ Final review of audit results + LLM narrative │
└──────────────────────────────────────────────────────────┘
The LLM receives audit findings and produces narrative summaries. It is constrained at the prompt level:
- Allowed: explain findings, prioritize by materiality, suggest remediation, reference specific cells.
- Disallowed: investment recommendations, valuation opinions, price targets, data invention.
The repo includes a synthetic test case: BOBWEIR Pharmaceuticals (sample_models/BOBWEIR_Model.xlsx).
| Sheet | Contents |
|---|---|
| Cover | Company overview, product list |
| Assumptions | Growth, margins, working capital drivers |
| Revenue | Revenue build for six products |
| IS | Income statement with linked formulas |
| BS | Balance sheet with working capital calculations |
| CF | Cash flow statement |
| DCF | DCF schedule and terminal value |
Intentional issues for testing: a hard-coded revenue plug (Revenue sheet, Neurex 2025E) and full formula linkages for dependency tracing. Regenerate with python scripts/create_sample_model.py.
The builder/ module is an independent financial modeling library included in this repo for convenience. It does not interact with the audit engine.
It provides programmatic construction of:
- DCF models — scenario-weighted discounted cash flow with sensitivity tables
- Comparable company analysis — peer selection and implied valuation
- Operating models — revenue/cost projections with flag detection
Each builder uses Pydantic for input validation and produces structured outputs with markdown reports. See examples/saas_dcf_walkthrough.py for usage.
Rubrics (eval/llm_rubrics/) — YAML-based rubrics for grading LLM audit narratives: safety_and_scope.yaml (scope adherence, hallucination risk), strategy_quality.yaml (plausibility and proportionality of recommendations), reasoning_fidelity.yaml (evidence grounding, logical consistency, uncertainty calibration).
Failure modes (docs/failure_modes.md) — Catalog of 10 LLM failure patterns with detection strategies: narrative overfitting, regime anchoring, false confidence amplification, explanation-action divergence, scope creep, hallucinated causation, severity inflation, and others.
Human review (human_review/) — Reviewer guidelines and sample reviews (good / borderline / failed outputs) for calibrating human evaluators.
Trainer tasks (trainer_tasks/) — Exercises for AI trainer / RLHF evaluation: grade outputs on rubrics, identify failure modes, propose prompt or policy fixes.
The architecture (deterministic core → scoped LLM layer → human review → rubrics) maps to other domains. See docs/transferability.md for patterns in compliance, clinical decision support, fraud detection, and cybersecurity.
| Principle | Implementation |
|---|---|
| Separate reasoning from control | LLM produces text; code performs audits and report generation |
| Constrained guidance | Prompts enforce scope limits and evidence referencing |
| Evaluation of non-numeric outputs | YAML rubrics and human review guidelines |
| Failure-mode awareness | Documented patterns and targeted test cases |
| Preference for interpretability | Graph-based checks and explicit evidence paths |
excel-model-eval/
├── src/ # Core audit engine
│ ├── ingestion.py # Dual-state loading
│ ├── dependency.py # Graph construction and analysis
│ ├── auditor.py # Heuristic checks and issue catalog
│ ├── reporting.py # PDF/Excel report generation and scoring
│ └── llm_analyzer.py # Optional LLM integration
├── builder/ # Financial model builder (independent)
│ ├── dcf_builder.py # DCF valuation engine
│ ├── comps_builder.py # Comparable company analysis
│ ├── operating_model.py # Revenue/cost projections
│ ├── assumptions.py # Pydantic input schema
│ ├── validators.py # Business logic checks
│ └── outputs.py # Result data structures
├── eval/ # LLM evaluation framework
│ └── llm_rubrics/ # YAML rubrics for grading LLM outputs
├── human_review/ # Human-in-the-loop materials
├── trainer_tasks/ # Evaluation exercises
├── docs/
│ ├── failure_modes.md # LLM failure pattern catalog
│ └── transferability.md # Cross-domain architecture mapping
├── sample_models/
│ └── BOBWEIR_Model.xlsx
├── scripts/
│ └── create_sample_model.py
├── app.py # Streamlit frontend
├── main.py # CLI entry point
├── .env.example # Template for API keys
└── requirements.txt
pytest tests/ -vUnder active development. Contributions welcome — areas for improvement include additional audit heuristics, workbook format support, LLM evaluation rubrics, and model builder capabilities.
MIT
Curiosity compounds. Rigor endures.