Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors Predbat’s ML load forecasting (“LoadML”) training loop to support curriculum-style multi-pass training, adds inverted-dropout regularization, and updates validation/early-stopping to incorporate bias alongside MAE.
Changes:
- Added curriculum training support and updated the LoadML component’s training scheduling behavior.
- Implemented inverted dropout in the NumPy MLP and extended model metadata to persist dropout/bias.
- Updated docs and tests to reflect the new training approach and metrics.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
docs/load-ml.md |
Documents curriculum training, dropout, bias-aware early stopping, and advanced internal knobs. |
apps/predbat/load_predictor.py |
Adds inverted dropout, bias metric, curriculum training helper, and persists new metadata fields. |
apps/predbat/load_ml_component.py |
Switches training scheduling to model-age-based retraining and integrates curriculum training into the component. |
apps/predbat/tests/test_load_ml.py |
Updates tests for the new _forward() signature and adds curriculum training coverage. |
apps/predbat/tests/test_component_base.py |
Adjusts timing in async tests to match fast-sleep scaling and component loop behavior. |
apps/predbat/component_base.py |
Changes the default per-run timeout for components. |
apps/predbat/predbat.py |
Bumps Predbat version. |
| # Even if initial was done we need to do one fine tuned curriculum pass too. | ||
| val_mae = self.predictor.train_curriculum( | ||
| load_data_snap, | ||
| now_utc_snap, | ||
| pv_minutes=pv_data_snap, | ||
| temp_minutes=temp_data_snap, | ||
| import_rates=import_rates_snap, | ||
| export_rates=export_rates_snap, | ||
| is_initial=is_initial, | ||
| epochs=epochs, | ||
| time_decay_days=time_decay, | ||
| validation_holdout_hours=holdout_hours, | ||
| patience=patience, | ||
| curriculum_window_days=window_days, | ||
| curriculum_step_days=step_days, | ||
| max_intermediate_passes=max_intermediate_passes, | ||
| ) |
There was a problem hiding this comment.
_do_training() calls train_curriculum() unconditionally after the initial-training branch, so initial training runs twice. It also means fine-tuning always uses train_curriculum(), whose first pass calls train(..., is_initial=True) and will refit normalization/skip the finetune optimizer reset. This is likely to regress fine-tuning behaviour and significantly increase CPU time; consider using curriculum only for initial training and calling train(..., is_initial=False) for fine-tunes.
| # to penalise both absolute error and systematic over/under-prediction | ||
| best_val_loss = baseline_mae if baseline_mae is not None else float("inf") | ||
| best_val_bias = baseline_bias # Track actual baseline bias, not 0.0 | ||
| best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf") |
There was a problem hiding this comment.
best_combined is initialized with a different formula than the metric used later (val_mae + 0.5*|val_bias|). Currently it uses (0.5*baseline_mae + |baseline_bias|), which changes the early-stopping threshold and can prevent any epoch from being considered an improvement. Please initialize best_combined with the same combined-metric formula used in the training loop.
| best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf") | |
| best_combined = (baseline_mae + 0.5 * abs(baseline_bias)) if baseline_mae is not None else float("inf") |
| - **Default**: 28 days | ||
| - **Minimum**: 1 day (not recommended for production) | ||
| - **Recommended**: 7-28 days depending on your consumption patterns | ||
| - **Minimum**: 7 day |
There was a problem hiding this comment.
The minimum value text should be pluralized: “7 day” → “7 days”.
| - **Minimum**: 7 day | |
| - **Minimum**: 7 days |
| print(f" Training on real load + {data_source} PV/temperature{rates_info} with {len(load_data)} points (curriculum)...") | ||
| success = predictor.train_curriculum( | ||
| load_data, | ||
| now_utc, | ||
| pv_minutes=pv_data, | ||
| temp_minutes=temp_data, | ||
| import_rates=import_rates_data, | ||
| export_rates=export_rates_data, | ||
| epochs=100, | ||
| time_decay_days=30, | ||
| validation_holdout_hours=24, | ||
| patience=10, | ||
| curriculum_window_days=7, | ||
| curriculum_step_days=7, | ||
| max_intermediate_passes=8, | ||
| ) | ||
| success = predictor.train_curriculum( | ||
| load_data, | ||
| now_utc, | ||
| pv_minutes=pv_data, | ||
| temp_minutes=temp_data, | ||
| import_rates=import_rates_data, | ||
| export_rates=export_rates_data, | ||
| epochs=30, | ||
| time_decay_days=30, | ||
| validation_holdout_hours=24, | ||
| patience=10, | ||
| curriculum_window_days=7, | ||
| curriculum_step_days=1, | ||
| max_intermediate_passes=4, | ||
| ) |
There was a problem hiding this comment.
success is assigned twice with two back-to-back train_curriculum() calls, and the first result is overwritten. If both are meant to run, this should be explicit (e.g., separate variables / assertions) because it doubles training time; otherwise, remove the redundant call.
| self.initialize(**kwargs) | ||
| self.count_errors = 0 | ||
| self.run_timeout = 15 * 60 # Default run time in seconds, can be overridden by subclasses | ||
| self.run_timeout = 60 * 60 # Default run time in seconds, can be overridden by subclasses |
There was a problem hiding this comment.
Increasing the default run_timeout from 15 minutes to 1 hour reduces the safeguard against hung components and will delay recovery/stack logging on deadlocks. If a longer timeout is only needed for specific components, consider overriding run_timeout in those subclasses instead of raising the global default, or document why 1 hour is required system-wide.
| self.run_timeout = 60 * 60 # Default run time in seconds, can be overridden by subclasses | |
| self.run_timeout = 15 * 60 # Default run time in seconds (15 minutes), can be overridden by subclasses |
apps/predbat/load_ml_component.py
Outdated
| curriculum_step_days=7, | ||
| max_intermediate_passes=8, |
There was a problem hiding this comment.
_do_training() hard-codes curriculum_step_days=7 and max_intermediate_passes=8 for initial training, ignoring the component’s configured self.ml_curriculum_step_days / self.ml_curriculum_max_passes. This makes the behaviour diverge from the documented/configured defaults; please use the local step_days/max_intermediate_passes variables consistently.
| curriculum_step_days=7, | |
| max_intermediate_passes=8, | |
| curriculum_step_days=step_days, | |
| max_intermediate_passes=max_intermediate_passes, |
| @@ -103,11 +103,14 @@ def __init__(self, log_func=None, learning_rate=0.001, max_load_kw=23.0, weight_ | |||
| learning_rate: Learning rate for Adam optimizer | |||
| max_load_kw: Maximum load in kW for clipping predictions | |||
| weight_decay: L2 regularization coefficient for AdamW (0.0 disables) | |||
| dropout_rate: Fraction of hidden-layer neurons to drop during training (0.0 disables); | |||
| inverted dropout is used so inference requires no scaling | |||
| """ | |||
| self.log = log_func if log_func else print | |||
| self.learning_rate = learning_rate | |||
| self.max_load_kw = max_load_kw | |||
| self.weight_decay = weight_decay | |||
| self.dropout_rate = dropout_rate | |||
There was a problem hiding this comment.
The constructor default dropout_rate=0.2 conflicts with the PR description/docs that describe a 10% dropout rate. If 10% is intended as the standard default, set the default to 0.1 so direct LoadPredictor() usage matches the documented behaviour.
| # Slice data to the oldest 'window' minutes. | ||
| # start_minute is the more-recent edge; after shifting it becomes key 0 | ||
| # so _create_dataset()'s holdout (lowest-key chunks) covers the transition | ||
| # boundary between this window and the next. | ||
| start_minute = max_minute - window | ||
| load_slice = self._slice_data_dict(load_minutes, start_minute, max_minute) | ||
| pv_slice = self._slice_data_dict(pv_minutes, start_minute, max_minute) | ||
| temp_slice = self._slice_data_dict(temp_minutes, start_minute, max_minute) | ||
| import_slice = self._slice_data_dict(import_rates, start_minute, max_minute) | ||
| export_slice = self._slice_data_dict(export_rates, start_minute, max_minute) | ||
|
|
||
| self.log("ML Predictor: Curriculum pass {}/{}: window={:.1f} days ({} hours of data)".format(pass_idx + 1, total_passes, window / day_minutes, window // 60)) | ||
|
|
||
| pass_mae = self.train( | ||
| load_slice, | ||
| now_utc, | ||
| pv_minutes=pv_slice, | ||
| temp_minutes=temp_slice, | ||
| import_rates=import_slice, | ||
| export_rates=export_slice, | ||
| is_initial=(pass_idx == 0), | ||
| epochs=epochs, | ||
| time_decay_days=time_decay_days, | ||
| patience=patience, | ||
| validation_holdout_hours=validation_holdout_hours, | ||
| norm_ema_alpha=norm_ema_alpha, | ||
| ) |
There was a problem hiding this comment.
Curriculum slicing re-indexes the selected window so its “most recent” edge becomes minute 0, but still passes the original now_utc into train(). Since _create_dataset() derives time-of-day/day-of-week features from now_utc - minute_offset, this will assign incorrect timestamps (especially weekday) to the sliced data. Either keep the original minute offsets (no re-index), or adjust now_utc per slice (e.g., now_utc - timedelta(minutes=start_minute)) so time features stay aligned with the underlying data.
apps/predbat/load_ml_component.py
Outdated
|
|
||
| self.log("ML Component: Fetched {} load data points, {:.1f} days of history".format(len(load_minutes_new), days_to_fetch)) | ||
| if 0: | ||
| if 1: |
There was a problem hiding this comment.
The debug dump is currently always enabled (if 1:) and writes ml_load_debug.json on every fetch. This can leak sensitive household usage/rate data and also adds unnecessary disk IO; please gate this behind an explicit debug flag/config option (or remove it) and default to disabled.
| if 1: | |
| # Optional debug dump of ML load input data; controlled by config flag and disabled by default | |
| debug_ml_dump = False | |
| try: | |
| # Prefer config option if available on the base PredBat instance | |
| debug_ml_dump = bool(self.base.get_arg("ml_load_debug", False)) | |
| except Exception: | |
| # Fallback to attribute on this component if present | |
| debug_ml_dump = bool(getattr(self, "ml_load_debug", False)) | |
| if debug_ml_dump: |
apps/predbat/load_ml_component.py
Outdated
| retrain_age_seconds = (self.now_utc - self.last_train_time).total_seconds() if self.last_train_time else RETRAIN_INTERVAL_SECONDS | ||
| should_train = not first and (retrain_age_seconds >= RETRAIN_INTERVAL_SECONDS) | ||
|
|
||
| # Fetch fresh load data periodically (every 15 minutes) |
There was a problem hiding this comment.
The fetch cadence comment says “every 15 minutes”, but PREDICTION_INTERVAL_SECONDS is 30 minutes. Please update the comment to match the actual interval to avoid confusion during maintenance.
| # Fetch fresh load data periodically (every 15 minutes) | |
| # Fetch fresh load data periodically (every 30 minutes) |
Larger re-work: