LoadML improvements by springfall2008 · Pull Request #3441 · springfall2008/batpred

springfall2008 · 2026-02-26T09:06:42Z

Larger re-work:

Added curriculum training for multiple passes
Added network dropout rate of 10%
Change metric to include total difference for the day and not just mae
Moved initial training out of component startup and fine tuning based on age.

…ude valdation bias.

Copilot

Pull request overview

This PR refactors Predbat’s ML load forecasting (“LoadML”) training loop to support curriculum-style multi-pass training, adds inverted-dropout regularization, and updates validation/early-stopping to incorporate bias alongside MAE.

Changes:

Added curriculum training support and updated the LoadML component’s training scheduling behavior.
Implemented inverted dropout in the NumPy MLP and extended model metadata to persist dropout/bias.
Updated docs and tests to reflect the new training approach and metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`docs/load-ml.md`	Documents curriculum training, dropout, bias-aware early stopping, and advanced internal knobs.
`apps/predbat/load_predictor.py`	Adds inverted dropout, bias metric, curriculum training helper, and persists new metadata fields.
`apps/predbat/load_ml_component.py`	Switches training scheduling to model-age-based retraining and integrates curriculum training into the component.
`apps/predbat/tests/test_load_ml.py`	Updates tests for the new `_forward()` signature and adds curriculum training coverage.
`apps/predbat/tests/test_component_base.py`	Adjusts timing in async tests to match fast-sleep scaling and component loop behavior.
`apps/predbat/component_base.py`	Changes the default per-run timeout for components.
`apps/predbat/predbat.py`	Bumps Predbat version.

Copilot · 2026-02-26T09:14:57Z

apps/predbat/load_ml_component.py

+            # Even if initial was done we need to do one fine tuned curriculum pass too.
+            val_mae = self.predictor.train_curriculum(
                load_data_snap,
                now_utc_snap,
                pv_minutes=pv_data_snap,
                temp_minutes=temp_data_snap,
                import_rates=import_rates_snap,
                export_rates=export_rates_snap,
-                is_initial=is_initial,
                epochs=epochs,
                time_decay_days=time_decay,
                validation_holdout_hours=holdout_hours,
                patience=patience,
+                curriculum_window_days=window_days,
+                curriculum_step_days=step_days,
+                max_intermediate_passes=max_intermediate_passes,
            )


_do_training() calls train_curriculum() unconditionally after the initial-training branch, so initial training runs twice. It also means fine-tuning always uses train_curriculum(), whose first pass calls train(..., is_initial=True) and will refit normalization/skip the finetune optimizer reset. This is likely to regress fine-tuning behaviour and significantly increase CPU time; consider using curriculum only for initial training and calling train(..., is_initial=False) for fine-tunes.

Copilot · 2026-02-26T09:14:57Z

apps/predbat/load_predictor.py

+        # to penalise both absolute error and systematic over/under-prediction
        best_val_loss = baseline_mae if baseline_mae is not None else float("inf")
+        best_val_bias = baseline_bias  # Track actual baseline bias, not 0.0
+        best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf")


best_combined is initialized with a different formula than the metric used later (val_mae + 0.5*|val_bias|). Currently it uses (0.5*baseline_mae + |baseline_bias|), which changes the early-stopping threshold and can prevent any epoch from being considered an improvement. Please initialize best_combined with the same combined-metric formula used in the training loop.

Suggested change

best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf")

best_combined = (baseline_mae + 0.5 * abs(baseline_bias)) if baseline_mae is not None else float("inf")

Copilot · 2026-02-26T09:14:57Z

docs/load-ml.md

    - **Default**: 28 days
-    - **Minimum**: 1 day (not recommended for production)
-    - **Recommended**: 7-28 days depending on your consumption patterns
+    - **Minimum**: 7 day


The minimum value text should be pluralized: “7 day” → “7 days”.

Suggested change

- **Minimum**: 7 day

- **Minimum**: 7 days

Copilot · 2026-02-26T09:14:58Z

apps/predbat/tests/test_load_ml.py

+    print(f"  Training on real load + {data_source} PV/temperature{rates_info} with {len(load_data)} points (curriculum)...")
+    success = predictor.train_curriculum(
+        load_data,
+        now_utc,
+        pv_minutes=pv_data,
+        temp_minutes=temp_data,
+        import_rates=import_rates_data,
+        export_rates=export_rates_data,
+        epochs=100,
+        time_decay_days=30,
+        validation_holdout_hours=24,
+        patience=10,
+        curriculum_window_days=7,
+        curriculum_step_days=7,
+        max_intermediate_passes=8,
+    )
+    success = predictor.train_curriculum(
+        load_data,
+        now_utc,
+        pv_minutes=pv_data,
+        temp_minutes=temp_data,
+        import_rates=import_rates_data,
+        export_rates=export_rates_data,
+        epochs=30,
+        time_decay_days=30,
+        validation_holdout_hours=24,
+        patience=10,
+        curriculum_window_days=7,
+        curriculum_step_days=1,
+        max_intermediate_passes=4,
+    )


success is assigned twice with two back-to-back train_curriculum() calls, and the first result is overwritten. If both are meant to run, this should be explicit (e.g., separate variables / assertions) because it doubles training time; otherwise, remove the redundant call.

Copilot · 2026-02-26T09:14:58Z

apps/predbat/component_base.py

        self.initialize(**kwargs)
        self.count_errors = 0
-        self.run_timeout = 15 * 60  # Default run time in seconds, can be overridden by subclasses
+        self.run_timeout = 60 * 60  # Default run time in seconds, can be overridden by subclasses


Increasing the default run_timeout from 15 minutes to 1 hour reduces the safeguard against hung components and will delay recovery/stack logging on deadlocks. If a longer timeout is only needed for specific components, consider overriding run_timeout in those subclasses instead of raising the global default, or document why 1 hour is required system-wide.

Suggested change

self.run_timeout = 60 * 60 # Default run time in seconds, can be overridden by subclasses

self.run_timeout = 15 * 60 # Default run time in seconds (15 minutes), can be overridden by subclasses

Copilot · 2026-02-26T09:14:58Z

apps/predbat/load_ml_component.py

+                    curriculum_step_days=7,
+                    max_intermediate_passes=8,


_do_training() hard-codes curriculum_step_days=7 and max_intermediate_passes=8 for initial training, ignoring the component’s configured self.ml_curriculum_step_days / self.ml_curriculum_max_passes. This makes the behaviour diverge from the documented/configured defaults; please use the local step_days/max_intermediate_passes variables consistently.

Suggested change

curriculum_step_days=7,

max_intermediate_passes=8,

curriculum_step_days=step_days,

max_intermediate_passes=max_intermediate_passes,

Copilot · 2026-02-26T09:14:58Z

apps/predbat/load_predictor.py

@@ -103,11 +103,14 @@ def __init__(self, log_func=None, learning_rate=0.001, max_load_kw=23.0, weight_
            learning_rate: Learning rate for Adam optimizer
            max_load_kw: Maximum load in kW for clipping predictions
            weight_decay: L2 regularization coefficient for AdamW (0.0 disables)
+            dropout_rate: Fraction of hidden-layer neurons to drop during training (0.0 disables);
+                          inverted dropout is used so inference requires no scaling
        """
        self.log = log_func if log_func else print
        self.learning_rate = learning_rate
        self.max_load_kw = max_load_kw
        self.weight_decay = weight_decay
+        self.dropout_rate = dropout_rate


The constructor default dropout_rate=0.2 conflicts with the PR description/docs that describe a 10% dropout rate. If 10% is intended as the standard default, set the default to 0.1 so direct LoadPredictor() usage matches the documented behaviour.

Copilot · 2026-02-26T09:14:59Z

apps/predbat/load_predictor.py

+            # Slice data to the oldest 'window' minutes.
+            # start_minute is the more-recent edge; after shifting it becomes key 0
+            # so _create_dataset()'s holdout (lowest-key chunks) covers the transition
+            # boundary between this window and the next.
+            start_minute = max_minute - window
+            load_slice = self._slice_data_dict(load_minutes, start_minute, max_minute)
+            pv_slice = self._slice_data_dict(pv_minutes, start_minute, max_minute)
+            temp_slice = self._slice_data_dict(temp_minutes, start_minute, max_minute)
+            import_slice = self._slice_data_dict(import_rates, start_minute, max_minute)
+            export_slice = self._slice_data_dict(export_rates, start_minute, max_minute)
+
+            self.log("ML Predictor: Curriculum pass {}/{}: window={:.1f} days ({} hours of data)".format(pass_idx + 1, total_passes, window / day_minutes, window // 60))
+
+            pass_mae = self.train(
+                load_slice,
+                now_utc,
+                pv_minutes=pv_slice,
+                temp_minutes=temp_slice,
+                import_rates=import_slice,
+                export_rates=export_slice,
+                is_initial=(pass_idx == 0),
+                epochs=epochs,
+                time_decay_days=time_decay_days,
+                patience=patience,
+                validation_holdout_hours=validation_holdout_hours,
+                norm_ema_alpha=norm_ema_alpha,
+            )


Curriculum slicing re-indexes the selected window so its “most recent” edge becomes minute 0, but still passes the original now_utc into train(). Since _create_dataset() derives time-of-day/day-of-week features from now_utc - minute_offset, this will assign incorrect timestamps (especially weekday) to the sliced data. Either keep the original minute offsets (no re-index), or adjust now_utc per slice (e.g., now_utc - timedelta(minutes=start_minute)) so time features stay aligned with the underlying data.

Copilot · 2026-02-26T09:14:59Z

apps/predbat/load_ml_component.py


            self.log("ML Component: Fetched {} load data points, {:.1f} days of history".format(len(load_minutes_new), days_to_fetch))
-            if 0:
+            if 1:


The debug dump is currently always enabled (if 1:) and writes ml_load_debug.json on every fetch. This can leak sensitive household usage/rate data and also adds unnecessary disk IO; please gate this behind an explicit debug flag/config option (or remove it) and default to disabled.

Suggested change

if 1:

# Optional debug dump of ML load input data; controlled by config flag and disabled by default

debug_ml_dump = False

try:

# Prefer config option if available on the base PredBat instance

debug_ml_dump = bool(self.base.get_arg("ml_load_debug", False))

except Exception:

# Fallback to attribute on this component if present

debug_ml_dump = bool(getattr(self, "ml_load_debug", False))

if debug_ml_dump:

Copilot · 2026-02-26T09:14:59Z

apps/predbat/load_ml_component.py

+        retrain_age_seconds = (self.now_utc - self.last_train_time).total_seconds() if self.last_train_time else RETRAIN_INTERVAL_SECONDS
+        should_train = not first and (retrain_age_seconds >= RETRAIN_INTERVAL_SECONDS)
+
        # Fetch fresh load data periodically (every 15 minutes)


The fetch cadence comment says “every 15 minutes”, but PREDICTION_INTERVAL_SECONDS is 30 minutes. Please update the comment to match the actual interval to avoid confusion during maintenance.

Suggested change

# Fetch fresh load data periodically (every 15 minutes)

# Fetch fresh load data periodically (every 30 minutes)

springfall2008 added 3 commits February 25, 2026 08:40

LoadML: Added 10% validation drop out. Changed best selection to incl…

64a655c

…ude valdation bias.

Rework load predictor to use curriculum training

4945be6

Doc updates

b69f090

Copilot AI review requested due to automatic review settings February 26, 2026 09:06

Copilot started reviewing on behalf of springfall2008 February 26, 2026 09:07 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Review feedback fixes

24dc06b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LoadML improvements#3441

LoadML improvements#3441
springfall2008 wants to merge 4 commits intomainfrom
loadml1

springfall2008 commented Feb 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	best_combined = (0.5 * baseline_mae + abs(baseline_bias)) if baseline_mae is not None else float("inf")
	best_combined = (baseline_mae + 0.5 * abs(baseline_bias)) if baseline_mae is not None else float("inf")

	self.run_timeout = 60 * 60 # Default run time in seconds, can be overridden by subclasses
	self.run_timeout = 15 * 60 # Default run time in seconds (15 minutes), can be overridden by subclasses

-            if 1:
+            # Optional debug dump of ML load input data; controlled by config flag and disabled by default
+            debug_ml_dump = False
+            try:
+                # Prefer config option if available on the base PredBat instance
+                debug_ml_dump = bool(self.base.get_arg("ml_load_debug", False))
+            except Exception:
+                # Fallback to attribute on this component if present
+                debug_ml_dump = bool(getattr(self, "ml_load_debug", False))
+            if debug_ml_dump:

	# Fetch fresh load data periodically (every 15 minutes)
	# Fetch fresh load data periodically (every 30 minutes)

Uh oh!

Conversation

springfall2008 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

springfall2008 commented Feb 26, 2026 •

edited

Loading