IFF - Instruction Following for Finance Framework

Overview

IFF (Instruction Following for Finance) is a specialized evaluation framework designed to assess language models' ability to follow complex instructions in finance-specific contexts. The framework provides over 40 specialized instruction checkers covering various financial domains including equities, credit, FX, compliance, risk management, derivatives, and more.

Features

Core Capabilities

40+ Finance-Specific Instructions: Comprehensive coverage of financial domains
Dual Evaluation Modes: Strict and loose evaluation strategies
Multi-Model Support: Tested with various LLMs (Kimi, Llama 3.1 8B, etc.)
Flexible Architecture: Modular design for easy extension
Detailed Reporting: Prompt-level and instruction-level accuracy metrics
Batch Processing: Efficient evaluation of large response sets
Type Safety: Full type hints with ty type checker integration

Financial Domains Covered

Equities & Trading: Market analysis, trading strategies, portfolio management
Credit & Fixed Income: Spread analysis, carry calculations, bond metrics
Foreign Exchange: FX calculations, cross-currency analysis
Compliance & Regulatory: Rule 10b-5, AML, regulatory reporting
Risk Management: VaR calculations, risk metrics, stress testing
Derivatives: Options pricing (Black-76), Greeks, structured products
Treasury Operations: Liquidity management, settlement processes
ESG & Climate Finance: Sustainability metrics, carbon accounting
Private Equity & VC: Deal structuring, valuation metrics
Quantitative Finance: Algorithmic strategies, pseudocode generation

Installation

Prerequisites

Python 3.12 or higher
Virtual environment (recommended)

Standard Installation

# Clone the repository
git clone <repository-url>
cd iff

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e .

Using UV Package Manager (Recommended)

# Install UV if not already installed
pip install uv

# Install dependencies with UV
uv pip install .

Development Installation

# Install with development dependencies
pip install -e .

Quick Start

0. Configure API Keys

Create a .env file in the project root and add your model provider keys. For TogetherAI, set the TOGETHERAI_API_KEY variable:

cp .env.example .env
echo "TOGETHERAI_API_KEY=your_api_key" >> .env

1. Generate Test Inputs

python build_input_jsonl.py
# This creates examples/inputs.jsonl with test prompts

2. Generate Model Responses

# Using litellm for multi-provider support (OpenAI, Anthropic, Together, etc.)
python generate_responses.py \
  --input examples/instructions.jsonl \
  --provider together \  # or: openai, anthropic, azure, cohere, huggingface
  --model "together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput"

# To use reasoning models add -no-cot flag to only get final output, currently available with (Anthropic and TogetherAI providers)
# Note: Untested with Anthropic models
python generate_responses.py
  --input examples/instructions.jsonl \
  --provider together \
  --model "deepseek-ai/DeepSeek-R1" \
--no-cot

3. Run Evaluation

python evaluation_bin.py \
--provider together\
--model deepseek-v3.1
# or
python evaluation_bin.py --run_dir results/together/deepseek-v3.1/runs/2025-08-26_19-59-18

4. View Results

# Results are saved in:
results/provider/model/runs
#eg. results/together/gpt-oss-120b/runs/2025-08-26_18-23-14
# results/together/gpt-oss-120b/runs/latest

5. Analyse Results

python analyze_eval.py \
  --strict results/together/gpt-oss-120b/runs/latest/evaluations/strict.jsonl \
  --loose  results/together/gpt-oss-120b/runs/latest/evaluations/loose.jsonl \
  --out    eval_reports/gpt-oss-120b_latest

Architecture

Module Structure

iff/
├── Core Modules
│   ├── evaluation_lib.py        # Core evaluation engine
│   ├── evaluation_bin.py        # CLI evaluation runner
│   ├── instructions_registry.py # Instruction registration system
│   └── instructions_util.py     # Utility functions for text processing
│
├── Instruction Modules
│   ├── instructions_iff.py      # General instruction checkers
│   └── finance_instructions.py  # Finance-specific instruction checkers
│
├── Data Generation
│   ├── build_input_jsonl.py     # Generate test inputs
│   └── generate_responses.py         # Multi-provider response generation
│
├── Configuration
│   ├── pyproject.toml           # Project metadata and dependencies
│   └── requirements.txt         # Python dependencies
│
└── Data Directories
    ├── results               # Responses and evaluation results
    ├── eval_reports          # Analysis of output responses.

Data Flow

graph LR
    A[Input Prompts] --> B[LLM Response Generation]
    B --> C[Response Collection]
    C --> D[Evaluation Engine]
    D --> E[Instruction Checkers]
    E --> F[Results & Metrics]

Instruction Categories

Core Finance Instructions (Top 10)

fin:equities:bold_intro_italic_risk - Equity analysis with formatting
fin:credit:table_spread_vs_carry - Credit spread and carry analysis
fin:fx:calc_codeblock_limit - FX calculations with code blocks
fin:compliance:rule10b5_numbered - Compliance rule formatting
fin:ops:settlement_checklist - Settlement process checklists
fin:ir:six_bullets_verb_buyback - Investor relations bulletpoints
fin:treasury:liquidity_risk_section - Treasury risk sections
fin:deriv:black76_latex_sigma - Derivatives pricing formulas
fin:risk:var_numbered_boldusd - VaR risk metrics
fin:pe:subheaders_dashes - Private equity formatting

Extended Finance Instructions (11-40+)

Quantitative analysis and pseudocode generation
Cryptocurrency and digital asset reporting
Asset-backed securities analysis
REIT analysis with word limits
Structured products documentation
Central bank communications
Credit ratings analysis
Pension fund reporting
Margin and collateral management
ETF analysis with timestamps
And many more...

Usage Guide

Creating Custom Instructions

from instructions_util import InstructionChecker

class CustomFinanceInstruction(InstructionChecker):
    def build_description(self, **kwargs):
        self.inst_description = "Your instruction description"

    def check_following(self, text):
        # Implement your checking logic
        return meets_criteria(text)

Registering Instructions

# In instructions_registry.py
CANONICAL["fin:custom:instruction"] = CustomFinanceInstruction

Running Evaluations Programmatically

import evaluation_lib as eval_lib

# Load test data
inputs = eval_lib.read_prompt_list("examples/inputs.jsonl")
responses = eval_lib.read_prompt_to_response_dict("examples/responses.jsonl")

# Run evaluation
outputs = []
for inp in inputs:
    result = eval_lib.test_instruction_following_strict(inp, responses)
    outputs.append(result)

# Generate report
eval_lib.print_report(outputs)

Evaluation Modes

Strict Mode

Exact matching of instruction requirements
No tolerance for formatting deviations
Single response attempt

Loose Mode

Multiple response variants tested
Tolerates minor formatting issues
Removes markdown artifacts
Tests with trimmed versions

API Reference

Core Classes

InputExample

@dataclasses.dataclass
class InputExample:
    key: int                    # Unique identifier
    instruction_id_list: list   # List of instruction IDs to check
    prompt: str                 # The prompt text
    kwargs: list               # Parameters for each instruction

OutputExample

@dataclasses.dataclass
class OutputExample:
    instruction_id_list: list      # Instructions checked
    prompt: str                    # Original prompt
    response: str                  # Model response
    follow_all_instructions: bool  # Overall success
    follow_instruction_list: list  # Per-instruction results

Key Functions

evaluation_lib.py

read_prompt_list(path) - Load test prompts
read_prompt_to_response_dict(path) - Load responses
test_instruction_following_strict() - Strict evaluation
test_instruction_following_loose() - Loose evaluation
print_report(outputs) - Generate evaluation report

instructions_util.py

count_words(text) - Word counting
split_into_sentences(text) - Sentence tokenization
numbered_lines(text) - Extract numbered lists
bullet_lines(text) - Extract bullet points
Text formatting validators

Performance Metrics

Evaluation Metrics

Prompt-Level Accuracy: Percentage of prompts where all instructions are followed
Instruction-Level Accuracy: Percentage of individual instructions followed correctly

Typical Results (Needs to be updated)

Model Performance (Example):
- Llama 3.1 8B:
  - Strict: prompt-level: 0.752, instruction-level: 0.891
  - Loose: prompt-level: 0.834, instruction-level: 0.923

- Kimi:
  - Strict: prompt-level: 0.698, instruction-level: 0.867
  - Loose: prompt-level: 0.792, instruction-level: 0.902

Development

Adding New Instructions

Create instruction class in finance_instructions.py
Register in instructions_registry.py
Add test cases in build_input_jsonl.py
Run evaluation pipeline

Testing

# Run unit tests
uv run pytest tests/

# Run specific test
uv run pytest tests/test_evaluation.py::test_strict_mode

# Run with coverage
uv run pytest tests/ --cov=. --cov-report=term-missing

Code Quality

# Type checking with ty (ultrafast Rust-based type checker)
uv run ty check .

# Format code with ruff
uv run ruff format .

# Lint with ruff
uv run ruff check .

# Run all checks
make all  # Runs lint, format, typecheck, and test

Type Checking

This project uses ty, Astral's ultrafast Python type checker written in Rust. All code includes comprehensive type hints.

# Install ty (included in dependencies)
uv add ty

# Run type checking
uv run ty check .

# Type check specific files
uv run ty check instructions_util.py finance_instructions.py

Configuration is in pyproject.toml under [tool.ty].

Troubleshooting

Common Issues

NLTK Data Missing

# The framework auto-downloads required NLTK data
# Manual download if needed:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')

Memory Issues with Large Datasets
- Process in batches
- Use --batch_size parameter
- Increase system memory allocation
API Rate Limits
- Implement exponential backoff
- Use --delay parameter between requests
- Consider local model deployment

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit pull request

License

MIT License - See LICENSE file for details

Citation

If you use IFF in your research, please cite:

@software{iff2024,
  title = {IFF: Instruction Following for Finance},
  year = {2024},
  url = {https://github.com/gtfintechlab/IFF}
}

Contact

For questions and support, please open an issue on GitHub.

Acknowledgments

IFF is built upon and adapted from several open-source projects:

Core Framework

IFEval (Instruction Following Evaluation): The core evaluation framework is adapted from IFEval, originally developed by:
- Google Research (evaluation_lib.py) - Licensed under Apache License 2.0
- Allen Institute for AI (instructions_registry.py, instructions_util.py) - Licensed under Apache License 2.0
We have adapted these components for finance-specific instruction evaluation while maintaining the original licensing terms.

Dependencies

LiteLLM: Multi-provider LLM integration - MIT License
NLTK: Natural language processing toolkit - Apache License 2.0
Absl-py: Abseil Python Common Libraries - Apache License 2.0

Submodules

litellm-gateway: Custom gateway module for LLM integration
manuscript: Research paper and documentation

All original copyrights and licenses have been preserved in the respective source files. This project is distributed under the MIT License for new contributions while respecting the licensing terms of incorporated components.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github		.github
docs		docs
scratch		scratch
scripts		scripts
tests		tests
.coveragerc		.coveragerc
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
LOGBOOK.md		LOGBOOK.md
Makefile		Makefile
README.md		README.md
analyze_eval.py		analyze_eval.py
build_input_jsonl.py		build_input_jsonl.py
evaluation_bin.py		evaluation_bin.py
evaluation_lib.py		evaluation_lib.py
fastapi_client.py		fastapi_client.py
fetch_models.py		fetch_models.py
finance_instructions.py		finance_instructions.py
generate_responses.py		generate_responses.py
instructions_registry.py		instructions_registry.py
instructions_util.py		instructions_util.py
list_runs.py		list_runs.py
litellm_client.py		litellm_client.py
metadata_collector.py		metadata_collector.py
provider_routing.py		provider_routing.py
py.typed		py.typed
pyproject.toml		pyproject.toml
results_manager.py		results_manager.py
run_evaluation.sh		run_evaluation.sh
uv.lock		uv.lock

License

gtfintechlab/FIFE

Folders and files

Latest commit

History

Repository files navigation

IFF - Instruction Following for Finance Framework

Overview

Table of Contents

Features

Core Capabilities

Financial Domains Covered

Installation

Prerequisites

Standard Installation

Using UV Package Manager (Recommended)

Development Installation

Quick Start

0. Configure API Keys

1. Generate Test Inputs

2. Generate Model Responses

3. Run Evaluation

4. View Results

5. Analyse Results

Architecture

Module Structure

Data Flow

Instruction Categories

Core Finance Instructions (Top 10)

Extended Finance Instructions (11-40+)

Usage Guide

Creating Custom Instructions

Registering Instructions

Running Evaluations Programmatically

Evaluation Modes

Strict Mode

Loose Mode

API Reference

Core Classes

InputExample

OutputExample

Key Functions

evaluation_lib.py

instructions_util.py

Performance Metrics

Evaluation Metrics

Typical Results (Needs to be updated)

Development

Adding New Instructions

Testing

Code Quality

Type Checking

Troubleshooting

Common Issues

Contributing

License

Citation

Contact

Acknowledgments

Core Framework

Dependencies

Submodules

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 4

Uh oh!

Languages