IFF (Instruction Following for Finance) is a specialized evaluation framework designed to assess language models' ability to follow complex instructions in finance-specific contexts. The framework provides over 40 specialized instruction checkers covering various financial domains including equities, credit, FX, compliance, risk management, derivatives, and more.
- Features
- Installation
- Quick Start
- Architecture
- Instruction Categories
- Usage Guide
- API Reference
- Contributing
- License
- 40+ Finance-Specific Instructions: Comprehensive coverage of financial domains
- Dual Evaluation Modes: Strict and loose evaluation strategies
- Multi-Model Support: Tested with various LLMs (Kimi, Llama 3.1 8B, etc.)
- Flexible Architecture: Modular design for easy extension
- Detailed Reporting: Prompt-level and instruction-level accuracy metrics
- Batch Processing: Efficient evaluation of large response sets
- Type Safety: Full type hints with
tytype checker integration
- Equities & Trading: Market analysis, trading strategies, portfolio management
- Credit & Fixed Income: Spread analysis, carry calculations, bond metrics
- Foreign Exchange: FX calculations, cross-currency analysis
- Compliance & Regulatory: Rule 10b-5, AML, regulatory reporting
- Risk Management: VaR calculations, risk metrics, stress testing
- Derivatives: Options pricing (Black-76), Greeks, structured products
- Treasury Operations: Liquidity management, settlement processes
- ESG & Climate Finance: Sustainability metrics, carbon accounting
- Private Equity & VC: Deal structuring, valuation metrics
- Quantitative Finance: Algorithmic strategies, pseudocode generation
- Python 3.12 or higher
- Virtual environment (recommended)
# Clone the repository
git clone <repository-url>
cd iff
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -e .# Install UV if not already installed
pip install uv
# Install dependencies with UV
uv pip install .# Install with development dependencies
pip install -e .Create a .env file in the project root and add your model provider keys.
For TogetherAI, set the TOGETHERAI_API_KEY variable:
cp .env.example .env
echo "TOGETHERAI_API_KEY=your_api_key" >> .envpython build_input_jsonl.py
# This creates examples/inputs.jsonl with test prompts# Using litellm for multi-provider support (OpenAI, Anthropic, Together, etc.)
python generate_responses.py \
--input examples/instructions.jsonl \
--provider together \ # or: openai, anthropic, azure, cohere, huggingface
--model "together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput"
# To use reasoning models add -no-cot flag to only get final output, currently available with (Anthropic and TogetherAI providers)
# Note: Untested with Anthropic models
python generate_responses.py
--input examples/instructions.jsonl \
--provider together \
--model "deepseek-ai/DeepSeek-R1" \
--no-cot
python evaluation_bin.py \
--provider together\
--model deepseek-v3.1
# or
python evaluation_bin.py --run_dir results/together/deepseek-v3.1/runs/2025-08-26_19-59-18
# Results are saved in:
results/provider/model/runs
#eg. results/together/gpt-oss-120b/runs/2025-08-26_18-23-14
# results/together/gpt-oss-120b/runs/latest
python analyze_eval.py \
--strict results/together/gpt-oss-120b/runs/latest/evaluations/strict.jsonl \
--loose results/together/gpt-oss-120b/runs/latest/evaluations/loose.jsonl \
--out eval_reports/gpt-oss-120b_latest
iff/
├── Core Modules
│ ├── evaluation_lib.py # Core evaluation engine
│ ├── evaluation_bin.py # CLI evaluation runner
│ ├── instructions_registry.py # Instruction registration system
│ └── instructions_util.py # Utility functions for text processing
│
├── Instruction Modules
│ ├── instructions_iff.py # General instruction checkers
│ └── finance_instructions.py # Finance-specific instruction checkers
│
├── Data Generation
│ ├── build_input_jsonl.py # Generate test inputs
│ └── generate_responses.py # Multi-provider response generation
│
├── Configuration
│ ├── pyproject.toml # Project metadata and dependencies
│ └── requirements.txt # Python dependencies
│
└── Data Directories
├── results # Responses and evaluation results
├── eval_reports # Analysis of output responses.
graph LR
A[Input Prompts] --> B[LLM Response Generation]
B --> C[Response Collection]
C --> D[Evaluation Engine]
D --> E[Instruction Checkers]
E --> F[Results & Metrics]
- fin:equities:bold_intro_italic_risk - Equity analysis with formatting
- fin:credit:table_spread_vs_carry - Credit spread and carry analysis
- fin:fx:calc_codeblock_limit - FX calculations with code blocks
- fin:compliance:rule10b5_numbered - Compliance rule formatting
- fin:ops:settlement_checklist - Settlement process checklists
- fin:ir:six_bullets_verb_buyback - Investor relations bulletpoints
- fin:treasury:liquidity_risk_section - Treasury risk sections
- fin:deriv:black76_latex_sigma - Derivatives pricing formulas
- fin:risk:var_numbered_boldusd - VaR risk metrics
- fin:pe:subheaders_dashes - Private equity formatting
- Quantitative analysis and pseudocode generation
- Cryptocurrency and digital asset reporting
- Asset-backed securities analysis
- REIT analysis with word limits
- Structured products documentation
- Central bank communications
- Credit ratings analysis
- Pension fund reporting
- Margin and collateral management
- ETF analysis with timestamps
- And many more...
from instructions_util import InstructionChecker
class CustomFinanceInstruction(InstructionChecker):
def build_description(self, **kwargs):
self.inst_description = "Your instruction description"
def check_following(self, text):
# Implement your checking logic
return meets_criteria(text)# In instructions_registry.py
CANONICAL["fin:custom:instruction"] = CustomFinanceInstructionimport evaluation_lib as eval_lib
# Load test data
inputs = eval_lib.read_prompt_list("examples/inputs.jsonl")
responses = eval_lib.read_prompt_to_response_dict("examples/responses.jsonl")
# Run evaluation
outputs = []
for inp in inputs:
result = eval_lib.test_instruction_following_strict(inp, responses)
outputs.append(result)
# Generate report
eval_lib.print_report(outputs)- Exact matching of instruction requirements
- No tolerance for formatting deviations
- Single response attempt
- Multiple response variants tested
- Tolerates minor formatting issues
- Removes markdown artifacts
- Tests with trimmed versions
@dataclasses.dataclass
class InputExample:
key: int # Unique identifier
instruction_id_list: list # List of instruction IDs to check
prompt: str # The prompt text
kwargs: list # Parameters for each instruction@dataclasses.dataclass
class OutputExample:
instruction_id_list: list # Instructions checked
prompt: str # Original prompt
response: str # Model response
follow_all_instructions: bool # Overall success
follow_instruction_list: list # Per-instruction resultsread_prompt_list(path)- Load test promptsread_prompt_to_response_dict(path)- Load responsestest_instruction_following_strict()- Strict evaluationtest_instruction_following_loose()- Loose evaluationprint_report(outputs)- Generate evaluation report
count_words(text)- Word countingsplit_into_sentences(text)- Sentence tokenizationnumbered_lines(text)- Extract numbered listsbullet_lines(text)- Extract bullet points- Text formatting validators
- Prompt-Level Accuracy: Percentage of prompts where all instructions are followed
- Instruction-Level Accuracy: Percentage of individual instructions followed correctly
Model Performance (Example):
- Llama 3.1 8B:
- Strict: prompt-level: 0.752, instruction-level: 0.891
- Loose: prompt-level: 0.834, instruction-level: 0.923
- Kimi:
- Strict: prompt-level: 0.698, instruction-level: 0.867
- Loose: prompt-level: 0.792, instruction-level: 0.902
- Create instruction class in
finance_instructions.py - Register in
instructions_registry.py - Add test cases in
build_input_jsonl.py - Run evaluation pipeline
# Run unit tests
uv run pytest tests/
# Run specific test
uv run pytest tests/test_evaluation.py::test_strict_mode
# Run with coverage
uv run pytest tests/ --cov=. --cov-report=term-missing# Type checking with ty (ultrafast Rust-based type checker)
uv run ty check .
# Format code with ruff
uv run ruff format .
# Lint with ruff
uv run ruff check .
# Run all checks
make all # Runs lint, format, typecheck, and testThis project uses ty, Astral's ultrafast Python type checker written in Rust. All code includes comprehensive type hints.
# Install ty (included in dependencies)
uv add ty
# Run type checking
uv run ty check .
# Type check specific files
uv run ty check instructions_util.py finance_instructions.pyConfiguration is in pyproject.toml under [tool.ty].
-
NLTK Data Missing
# The framework auto-downloads required NLTK data # Manual download if needed: import nltk nltk.download('punkt') nltk.download('punkt_tab') nltk.download('averaged_perceptron_tagger')
-
Memory Issues with Large Datasets
- Process in batches
- Use
--batch_sizeparameter - Increase system memory allocation
-
API Rate Limits
- Implement exponential backoff
- Use
--delayparameter between requests - Consider local model deployment
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit pull request
MIT License - See LICENSE file for details
If you use IFF in your research, please cite:
@software{iff2024,
title = {IFF: Instruction Following for Finance},
year = {2024},
url = {https://github.com/gtfintechlab/IFF}
}For questions and support, please open an issue on GitHub.
IFF is built upon and adapted from several open-source projects:
-
IFEval (Instruction Following Evaluation): The core evaluation framework is adapted from IFEval, originally developed by:
- Google Research (evaluation_lib.py) - Licensed under Apache License 2.0
- Allen Institute for AI (instructions_registry.py, instructions_util.py) - Licensed under Apache License 2.0
We have adapted these components for finance-specific instruction evaluation while maintaining the original licensing terms.
- LiteLLM: Multi-provider LLM integration - MIT License
- NLTK: Natural language processing toolkit - Apache License 2.0
- Absl-py: Abseil Python Common Libraries - Apache License 2.0
- litellm-gateway: Custom gateway module for LLM integration
- manuscript: Research paper and documentation
All original copyrights and licenses have been preserved in the respective source files. This project is distributed under the MIT License for new contributions while respecting the licensing terms of incorporated components.