Skip to content

LoadML improvements#3441

Open
springfall2008 wants to merge 4 commits intomainfrom
loadml1
Open

LoadML improvements#3441
springfall2008 wants to merge 4 commits intomainfrom
loadml1

Conversation

@springfall2008
Copy link
Owner

@springfall2008 springfall2008 commented Feb 26, 2026

Larger re-work:

  • Added curriculum training for multiple passes
  • Added network dropout rate of 10%
  • Change metric to include total difference for the day and not just mae
  • Moved initial training out of component startup and fine tuning based on age.

Copilot AI review requested due to automatic review settings February 26, 2026 09:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Predbat’s ML load forecasting (“LoadML”) training loop to support curriculum-style multi-pass training, adds inverted-dropout regularization, and updates validation/early-stopping to incorporate bias alongside MAE.

Changes:

  • Added curriculum training support and updated the LoadML component’s training scheduling behavior.
  • Implemented inverted dropout in the NumPy MLP and extended model metadata to persist dropout/bias.
  • Updated docs and tests to reflect the new training approach and metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/load-ml.md Documents curriculum training, dropout, bias-aware early stopping, and advanced internal knobs.
apps/predbat/load_predictor.py Adds inverted dropout, bias metric, curriculum training helper, and persists new metadata fields.
apps/predbat/load_ml_component.py Switches training scheduling to model-age-based retraining and integrates curriculum training into the component.
apps/predbat/tests/test_load_ml.py Updates tests for the new _forward() signature and adds curriculum training coverage.
apps/predbat/tests/test_component_base.py Adjusts timing in async tests to match fast-sleep scaling and component loop behavior.
apps/predbat/component_base.py Changes the default per-run timeout for components.
apps/predbat/predbat.py Bumps Predbat version.

Comment on lines +900 to 915
# Even if initial was done we need to do one fine tuned curriculum pass too.
val_mae = self.predictor.train_curriculum(
load_data_snap,
now_utc_snap,
pv_minutes=pv_data_snap,
temp_minutes=temp_data_snap,
import_rates=import_rates_snap,
export_rates=export_rates_snap,
is_initial=is_initial,
epochs=epochs,
time_decay_days=time_decay,
validation_holdout_hours=holdout_hours,
patience=patience,
curriculum_window_days=window_days,
curriculum_step_days=step_days,
max_intermediate_passes=max_intermediate_passes,
)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_do_training() calls train_curriculum() unconditionally after the initial-training branch, so initial training runs twice. It also means fine-tuning always uses train_curriculum(), whose first pass calls train(..., is_initial=True) and will refit normalization/skip the finetune optimizer reset. This is likely to regress fine-tuning behaviour and significantly increase CPU time; consider using curriculum only for initial training and calling train(..., is_initial=False) for fine-tunes.

Copilot uses AI. Check for mistakes.
# to penalise both absolute error and systematic over/under-prediction
best_val_loss = baseline_mae if baseline_mae is not None else float("inf")
best_val_bias = baseline_bias # Track actual baseline bias, not 0.0
best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf")
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best_combined is initialized with a different formula than the metric used later (val_mae + 0.5*|val_bias|). Currently it uses (0.5*baseline_mae + |baseline_bias|), which changes the early-stopping threshold and can prevent any epoch from being considered an improvement. Please initialize best_combined with the same combined-metric formula used in the training loop.

Suggested change
best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf")
best_combined = (baseline_mae + 0.5 * abs(baseline_bias)) if baseline_mae is not None else float("inf")

Copilot uses AI. Check for mistakes.
- **Default**: 28 days
- **Minimum**: 1 day (not recommended for production)
- **Recommended**: 7-28 days depending on your consumption patterns
- **Minimum**: 7 day
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum value text should be pluralized: “7 day” → “7 days”.

Suggested change
- **Minimum**: 7 day
- **Minimum**: 7 days

Copilot uses AI. Check for mistakes.
Comment on lines 1218 to 1248
print(f" Training on real load + {data_source} PV/temperature{rates_info} with {len(load_data)} points (curriculum)...")
success = predictor.train_curriculum(
load_data,
now_utc,
pv_minutes=pv_data,
temp_minutes=temp_data,
import_rates=import_rates_data,
export_rates=export_rates_data,
epochs=100,
time_decay_days=30,
validation_holdout_hours=24,
patience=10,
curriculum_window_days=7,
curriculum_step_days=7,
max_intermediate_passes=8,
)
success = predictor.train_curriculum(
load_data,
now_utc,
pv_minutes=pv_data,
temp_minutes=temp_data,
import_rates=import_rates_data,
export_rates=export_rates_data,
epochs=30,
time_decay_days=30,
validation_holdout_hours=24,
patience=10,
curriculum_window_days=7,
curriculum_step_days=1,
max_intermediate_passes=4,
)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success is assigned twice with two back-to-back train_curriculum() calls, and the first result is overwritten. If both are meant to run, this should be explicit (e.g., separate variables / assertions) because it doubles training time; otherwise, remove the redundant call.

Copilot uses AI. Check for mistakes.
self.initialize(**kwargs)
self.count_errors = 0
self.run_timeout = 15 * 60 # Default run time in seconds, can be overridden by subclasses
self.run_timeout = 60 * 60 # Default run time in seconds, can be overridden by subclasses
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the default run_timeout from 15 minutes to 1 hour reduces the safeguard against hung components and will delay recovery/stack logging on deadlocks. If a longer timeout is only needed for specific components, consider overriding run_timeout in those subclasses instead of raising the global default, or document why 1 hour is required system-wide.

Suggested change
self.run_timeout = 60 * 60 # Default run time in seconds, can be overridden by subclasses
self.run_timeout = 15 * 60 # Default run time in seconds (15 minutes), can be overridden by subclasses

Copilot uses AI. Check for mistakes.
Comment on lines 897 to 898
curriculum_step_days=7,
max_intermediate_passes=8,
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_do_training() hard-codes curriculum_step_days=7 and max_intermediate_passes=8 for initial training, ignoring the component’s configured self.ml_curriculum_step_days / self.ml_curriculum_max_passes. This makes the behaviour diverge from the documented/configured defaults; please use the local step_days/max_intermediate_passes variables consistently.

Suggested change
curriculum_step_days=7,
max_intermediate_passes=8,
curriculum_step_days=step_days,
max_intermediate_passes=max_intermediate_passes,

Copilot uses AI. Check for mistakes.
Comment on lines 97 to 113
@@ -103,11 +103,14 @@ def __init__(self, log_func=None, learning_rate=0.001, max_load_kw=23.0, weight_
learning_rate: Learning rate for Adam optimizer
max_load_kw: Maximum load in kW for clipping predictions
weight_decay: L2 regularization coefficient for AdamW (0.0 disables)
dropout_rate: Fraction of hidden-layer neurons to drop during training (0.0 disables);
inverted dropout is used so inference requires no scaling
"""
self.log = log_func if log_func else print
self.learning_rate = learning_rate
self.max_load_kw = max_load_kw
self.weight_decay = weight_decay
self.dropout_rate = dropout_rate
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor default dropout_rate=0.2 conflicts with the PR description/docs that describe a 10% dropout rate. If 10% is intended as the standard default, set the default to 0.1 so direct LoadPredictor() usage matches the documented behaviour.

Copilot uses AI. Check for mistakes.
Comment on lines 1151 to 1177
# Slice data to the oldest 'window' minutes.
# start_minute is the more-recent edge; after shifting it becomes key 0
# so _create_dataset()'s holdout (lowest-key chunks) covers the transition
# boundary between this window and the next.
start_minute = max_minute - window
load_slice = self._slice_data_dict(load_minutes, start_minute, max_minute)
pv_slice = self._slice_data_dict(pv_minutes, start_minute, max_minute)
temp_slice = self._slice_data_dict(temp_minutes, start_minute, max_minute)
import_slice = self._slice_data_dict(import_rates, start_minute, max_minute)
export_slice = self._slice_data_dict(export_rates, start_minute, max_minute)

self.log("ML Predictor: Curriculum pass {}/{}: window={:.1f} days ({} hours of data)".format(pass_idx + 1, total_passes, window / day_minutes, window // 60))

pass_mae = self.train(
load_slice,
now_utc,
pv_minutes=pv_slice,
temp_minutes=temp_slice,
import_rates=import_slice,
export_rates=export_slice,
is_initial=(pass_idx == 0),
epochs=epochs,
time_decay_days=time_decay_days,
patience=patience,
validation_holdout_hours=validation_holdout_hours,
norm_ema_alpha=norm_ema_alpha,
)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curriculum slicing re-indexes the selected window so its “most recent” edge becomes minute 0, but still passes the original now_utc into train(). Since _create_dataset() derives time-of-day/day-of-week features from now_utc - minute_offset, this will assign incorrect timestamps (especially weekday) to the sliced data. Either keep the original minute offsets (no re-index), or adjust now_utc per slice (e.g., now_utc - timedelta(minutes=start_minute)) so time features stay aligned with the underlying data.

Copilot uses AI. Check for mistakes.

self.log("ML Component: Fetched {} load data points, {:.1f} days of history".format(len(load_minutes_new), days_to_fetch))
if 0:
if 1:
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug dump is currently always enabled (if 1:) and writes ml_load_debug.json on every fetch. This can leak sensitive household usage/rate data and also adds unnecessary disk IO; please gate this behind an explicit debug flag/config option (or remove it) and default to disabled.

Suggested change
if 1:
# Optional debug dump of ML load input data; controlled by config flag and disabled by default
debug_ml_dump = False
try:
# Prefer config option if available on the base PredBat instance
debug_ml_dump = bool(self.base.get_arg("ml_load_debug", False))
except Exception:
# Fallback to attribute on this component if present
debug_ml_dump = bool(getattr(self, "ml_load_debug", False))
if debug_ml_dump:

Copilot uses AI. Check for mistakes.
retrain_age_seconds = (self.now_utc - self.last_train_time).total_seconds() if self.last_train_time else RETRAIN_INTERVAL_SECONDS
should_train = not first and (retrain_age_seconds >= RETRAIN_INTERVAL_SECONDS)

# Fetch fresh load data periodically (every 15 minutes)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch cadence comment says “every 15 minutes”, but PREDICTION_INTERVAL_SECONDS is 30 minutes. Please update the comment to match the actual interval to avoid confusion during maintenance.

Suggested change
# Fetch fresh load data periodically (every 15 minutes)
# Fetch fresh load data periodically (every 30 minutes)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants