This repository contains the code for running a rule-based decision process to identify the appropriate verification method (field or remote) for a TerraFund project.
The primary functions of this code are located within run_decision_tree.py.
Data Gathering & Cleaning
- Query the TerraMatch API, Maxar API, and OpenTopo API to gather input data for the decision tree
- Process, validate, and clean the API response into the various "branches" of the tree
Apply Logic
- Apply the rules framework
Make Decisions
- Apply the weighted scoring approach at the polygon level
- Calculate the cost to monitor the project
- Aggregate polygon scores to derive project score
Upload
- Push results to Asana
- Push results to S3
- Goals: create an individual verification score for each project in the TerraFund portfolio.
- Non-Goals: score projects that are not part of TerraFund.
| Repo | Artifact Provided | Interface | Versioning/Contract | Notes |
|---|---|---|---|---|
| maxar-tools | API response | --- | ------------------- | ----- |
| terramatch-researcher-api | API response | -- | ------------------- | ----- |
| tf-biophysical-monitoring | patched tree cover | -- | ------------------- | ----- |
| Source | Type (API/S3/etc.) | Endpoint/Path | Auth | Data Format | Schema Link |
|---|---|---|---|---|---|
| OpenTopography | API | ------------- | API key | geotiff | ----------- |
- mode dependent - mainly defers to
params.yamlfile or csv for processing
- Schema checks:
process_api_results.pychecks dtypes, missing or incomplete data and duplicates - Range checks: None
- Null/duplicate handling:
process_api_results.pychecks dtypes, missing or incomplete data and duplicates - Failure behavior (fail fast vs warn): None
- Upstream repos::
tf-biophysical-indicators,terramatch-researcher-api,maxar-tools - External data sources: (S3/API/DB):
Opentopo API - Expected input format(s): (e.g., Parquet, GeoTIFF, JSON)
- Input validation logic: (schema checks, nulls, ranges):
| Artifact | Format | Path/Location | Update Frequency | Size | Retention |
|---|---|---|---|---|---|
| project ids | csv | data/portfolio_csvs/prj_ids_{cohort}_{data_version}.csv |
N/A | x | Permanent |
| tm API response | json | data/tm_api_response/{cohort}_{data_version}.json |
N/A | x | Permanent |
| tn clean response | csv | data/feats/tm_api_{cohort}_{data_version}.csv |
N/A | x | Permanent |
| project polygon | geojson | s3://tree_verification/output/project_data/SHORTNAME/geojson/SHORTNAME_{date}.geojson |
N/A | x | Permanent |
| img metadata | csv | data/imagery_availability/comb_img_availability_{cohort}_{data_version}.csv |
N/A | x | Permanent |
| slope statistics | csv | data/slope/slope_statistics_{cohort}_{data_version}.csv |
N/A | x | Permanent |
| decision scores | csv | data/decision_scores/prj_output_{cohort}_{data_version}_{experiment_id}.csv |
N/A | x | Permanent |
| Repo | Consumed Artifact | Contract Type | Dependency Level (strong/soft) | | tree-verification | binary decision | ---- | strong | | maxar-tools (image ordering) | binary decision | ---- | strong | | maxar-tools (image availability) | image count | ---- | soft |
- Schema stability: None
- Freshness SLO: none
- Completeness / QA checks before publish: none
- Artifacts produced: (files, tables, models):
Config files: params.yaml, rule_template.csv
criteria:
canopy_threshold: 60
cloud_thresh: 50
off_nadir: 30
sun_elevation: 30
img_count: 1
baseline_range: [-365, 0]
ev_range: [730, 1095]
drop_missing: False
slope_thresh: 20
rules: data/rule_template.csv
Don't actually list the config keys in plain text but describe what is needed.
| Key | Type | Default | Required | Description |
|---|
Config files: secrets.yaml
access_token:
opentopo_key:
aws:
aws_access_key_id:
aws_secret_access_key:
aws_region:
asana:
pat: | Command | Purpose |
|---|---|
| python3 run_decision_tree.py | runs pipeline |
There is currently no orchestration (manual trigger via CLI) but this pipeline needs to be run after the polygons have been approved on TerraMatch and the tree cover indicator has been calculated.
| System | Trigger | Frequency | Dependency Triggered? |
|---|---|---|---|
| N/A | CLI | Manual | No |
- Expected runtime: tbd
- GPU/CPU requirements:
- Storage footprint: ~1GB
- Cost control levers (batch size, concurrency): N/A
- Entry point(s):
- CLI usage: (example commands)
- Schedulers / Orchestrators: (Airflow / Cron / GitHub Actions):
- Runtime + cost notes:
can create a requirements.txt when the time comes.
The logging for this repository should ideally create a .jsonl file within each SHORTNAME directory specifying success / failure.
- Format: Currently only logging via print statements and capture to a txt file via
python3 main.py > tmp_log.txt - Logging levels: Singular
- Sample log message:
| Metric | Type | Labels | Description |
|---|
- Links
- Alerting rules
- Logging: (format, sinks)
Failure Case *
- Where checkpoints stored:
- Resume instructions:
- Common failure scenarios
- Engineering owner: Jessica Ertel
- Data owner
- On-call/rotation (if exists)
- Code access
- Data bucket access
- Secrets access
<Upstream repo/data> β This repo β <Downstream repo/data>
- Sync frequency with upstream/downstream repos
- Dependency risks