Hyperparameter Optimization
This chapter explains how to run HPO, how to design a good search space, and — most importantly — why the choices made here work.
The HPO Philosophy
The temptation in HPO is to tune everything. Resist it.
The guiding principle here is: fix capacity, search expressiveness. In the default MLP architecture, nodes controls how wide the network is — that is its capacity to store information. layers controls how many nonlinear transformations it applies — that is its expressiveness. These two parameters have very different roles.
Fixing nodes and tuning layers makes HPO trials comparable to each other: every trial has the same number of parameters per layer, so differences in validation loss are due to depth (expressiveness), not raw parameter count. If you tune both simultaneously, a shallow-but-wide network and a deep-but-narrow network can reach the same loss for completely different reasons, and the optimizer cannot learn a useful signal about either dimension.
The same reasoning applies to total_steps and upper_bound in the scheduler: these define the shape of the learning rate schedule and should remain fixed relative to the training duration. Let HPO search for the right learning rate magnitude and floor (lr, min_lr) rather than the schedule geometry.
Rule of thumb: tune at most 3–5 parameters per HPO run. More parameters than that causes combinatorial explosion — you need exponentially more trials to cover the space.
Search Space Design
The search space is defined in a separate YAML file (e.g., configs/optimize_template.yaml). It is grouped by config category (net_config, optimizer_config, scheduler_config) and each parameter has a type and bounds.
Supported types:
| Type | Required fields | Optional fields | Notes |
|---|---|---|---|
int | min, max | step | Discrete integer range |
float | min, max | log: true | Continuous; use log for orders-of-magnitude ranges |
categorical | choices | — | Explicit list; use for architecture choices |
Here is a complete, well-designed search space for the SPlus + ExpHyperbolicLR combination (HPO run config):
# configs/hpo.yaml
study_name: MyProject_HPO
trials: 50
seed: 42
metric: val_loss
direction: minimize
sampler:
name: optuna.samplers.TPESampler
pruner:
name: pruner.PFLPruner
kwargs:
n_startup_trials: 10
n_warmup_epochs: 10
top_k: 10
target_epoch: 10 # match HPO epochs
search_space:
net_config:
layers:
type: int
min: 3
max: 6
optimizer_config:
lr:
type: float
min: 1.e-3
max: 1.e+0
log: true
scheduler_config:
min_lr:
type: float
min: 1.e-7
max: 1.e-3
log: true
What to tune, and what not to:
layers: tune. Depth has a large effect on whether the network can capture the target function’s complexity.lr: tune, with a wide log-scale range. SPlus has internal eigenvalue-based scaling (see next section), so1e-3to1e+0is the correct range — not1e-5to1e-2.min_lr: tune. The learning rate floor determines how much fine-tuning the scheduler provides at convergence.nodes: do not tune. Fix it based on your memory budget and desired model size.upper_bound: do not tune. This is a schedule geometry parameter. Set it once based on how quickly you expect the model to converge.total_steps: do not tune. Fix it to match the number of HPO epochs (10).
SPlus + ExpHyperbolicLR: Why 10 Epochs Is Enough
This section addresses the most important practical insight in this template.
SPlus: wide lr range is correct
SPlus (from pytorch_optimizer) is a second-order-inspired optimizer that applies an internal eigenvalue-based scaling to each parameter update. The effective step size is much smaller than the nominal lr you pass to it. A lr=0.1 with SPlus behaves more like lr=1e-4 with Adam in terms of actual parameter movement.
This is why the HPO range for lr must be [1e-3, 1e+0] (log scale). If you narrow it to [1e-5, 1e-2] — a reasonable range for Adam — you will miss the effective region entirely and HPO will return a suboptimal result. Do not change this range.
ExpHyperbolicLR: epoch-insensitive ordering
ExpHyperbolicLR (from pytorch_scheduler) has an unusual property: the relative ordering of two configurations by validation loss after 10 epochs is the same as their ordering after 150 epochs, provided the schedule shapes are comparable.
This happens because the scheduler’s decay is hyperbolic — it drops quickly early and then flattens. The first 10 epochs capture the important part of the loss descent. After that, additional epochs refine the result but do not change which configuration wins.
This means you can set epochs: 10 in the HPO run config and get reliable results that transfer to the full 150-epoch run.
The critical detail: set total_steps: 10 (matching HPO epochs) in the HPO config, but upper_bound stays fixed. The target_epoch in PFLPruner should also match HPO epochs. Here is how the two configs relate:
| Parameter | HPO run config | Best run config |
|---|---|---|
epochs | 10 | 150 |
scheduler_config.total_steps | 10 | 150 |
scheduler_config.upper_bound | (fixed, e.g. 250) | same |
seeds | [42] | [58, 89, 231, 928, 814] |
The upper_bound stays the same in both runs because it controls the asymptotic behavior of the schedule, not the number of steps.
PFL Pruner
Running every HPO trial to completion is wasteful. Most bad configurations reveal themselves early. The PFLPruner (Predicted Final Loss Pruner) exploits this by predicting where a trial will end up and stopping it early if the prediction is poor.
How it works
- Startup phase: the first
n_startup_trialstrials run to completion unconditionally. These populate the “top-K” reference set. - Warmup phase: within each trial, the first
n_warmup_epochsepochs are never pruned. This gives the loss enough data to fit. - Prediction: after the warmup period, PFLPruner fits an exponential decay $L(t) = A e^{Kt}$ to the validation loss history using
numpy.polyfiton the log-transformed losses. It then extrapolates totarget_epochto get a Predicted Final Loss (PFL). - Pruning decision: if the current trial’s PFL is worse than the worst PFL in the top-K set, the trial is pruned.
The PFL is computed as $-\log_{10}(L_{\text{predicted}})$, so higher PFL is better (lower predicted loss). A trial is pruned if its PFL falls below the minimum PFL of the top-K reference trials.
The result: approximately 40% GPU time savings compared to running all trials to completion, with no loss in the quality of the best trial found.
Configuration parameters
pruner:
name: pruner.PFLPruner
kwargs:
n_startup_trials: 10 # Run these unconditionally to build a reference
n_warmup_epochs: 10 # Within each trial, never prune before this epoch
top_k: 10 # Size of the reference set
target_epoch: 10 # Predict loss at this epoch (match HPO epochs)
Setting target_epoch to match epochs in the run config ensures the pruner predicts to the actual end of training, not beyond it.
Reading HPO Results with hpo-report
After the HPO run completes, a SQLite database file is created (e.g., MyProject_Opt.db). Use hpo-report to analyze it:
# Auto-detect the .db file if only one exists
python cli.py hpo-report
# Explicitly specify database and optimization config for boundary warnings
python cli.py hpo-report --db MyProject_Opt.db --opt-config configs/hpo.yaml --top-k 5
Example output:
Study: MyProject_HPO (MyProject_Opt.db)
Trials: 50 total, 28 completed, 19 pruned, 3 failed
Best Trial #17
Value: 0.000312
Group: MLP_n_64_l_4_SP_l_3.4210e-02_EHLRS_t_10_u_250_m_1.2345e-06[17]
net_config_layers: 4
optimizer_config_lr: 0.03421
scheduler_config_min_lr: 1.23e-06
┌─────────────────────────────────────────────────────┐
│ Parameter Importance │
├─────────────────────────┬───────────────────────────┤
│ optimizer_config_lr │ 0.6231 ███████████████████ │
│ net_config_layers │ 0.2847 █████████ │
│ scheduler_config_min_lr │ 0.0922 ███ │
└─────────────────────────┴───────────────────────────┘
Boundary Warnings:
scheduler_config_min_lr=1.23e-06 at LOWER boundary [1e-07, 1e-03]
┌──────────────────────────────────────────────────┐
│ Top 5 Trials │
├────┬──────────┬─────────┬───────────┬────────────┤
│ # │ Value │ layers │ lr │ min_lr │
├────┼──────────┼─────────┼───────────┼────────────┤
│ 17 │ 0.000312 │ 4 │ 3.4210e-2 │ 1.2345e-6 │
│ 23 │ 0.000389 │ 4 │ 2.9876e-2 │ 8.9123e-7 │
│ 8 │ 0.000401 │ 5 │ 4.1234e-2 │ 1.5678e-6 │
│ 31 │ 0.000445 │ 3 │ 3.8765e-2 │ 2.3456e-6 │
│ 12 │ 0.000512 │ 4 │ 5.6789e-2 │ 9.8765e-7 │
└────┴──────────┴─────────┴───────────┴────────────┘
How to read each section:
- Stats: the ratio of pruned to completed trials shows pruner effectiveness. A 40–60% prune rate is healthy. Over 80% may indicate the search space is too wide.
- Best trial: the winning configuration. Use this as the basis for
best.yaml. - Parameter importance (fANOVA): which parameters explain the most variance in trial outcomes. If one parameter dominates (>0.7), concentrate your next search around it. If all parameters are roughly equal, the search space is well-balanced.
- Boundary warnings: if the best value is within 5% of a search bound, the optimum may lie outside your current range. Widen the bound in that direction and re-run HPO.
- Top-K table: compare the top trials. If they cluster tightly in one parameter (e.g., all top trials have
layers=4), that parameter is well-determined. Use it as a fixed value in future runs.
From HPO to best.yaml
Once HPO completes and you have analyzed the results, create best.yaml for the final multi-seed run.
Start from your HPO run config and make the following changes:
# best.yaml — final training run after HPO
project: MyProject_Best
device: cuda:0
net: model.MLP
optimizer: pytorch_optimizer.SPlus
scheduler: pytorch_scheduler.ExpHyperbolicLRScheduler
criterion: torch.nn.MSELoss
criterion_config: {}
data: util.load_data
# 1. Increase epochs from 10 to full training duration
epochs: 150
# 2. Expand seeds for statistical robustness
seeds: [58, 89, 231, 928, 814]
batch_size: 256
net_config:
nodes: 64 # unchanged — fixed capacity
layers: 4 # from HPO best trial
optimizer_config:
lr: 3.421e-2 # from HPO best trial
eps: 1.e-10
scheduler_config:
# 3. Set total_steps to match full epochs
total_steps: 150
upper_bound: 250 # unchanged — fixed geometry
min_lr: 1.234e-6 # from HPO best trial
# 4. Enable early stopping and checkpointing for the full run
early_stopping_config:
enabled: true
patience: 20
mode: min
min_delta: 0.0001
checkpoint_config:
enabled: true
save_every_n_epochs: 10
keep_last_k: 3
save_best: true
monitor: val_loss
mode: min
Then run:
python cli.py train best.yaml
The five seeds will each train independently and log to W&B under the same group name, giving you a distribution of final validation losses to report mean ± std.
If the boundary warning showed that min_lr was at the lower edge of its search range, widen the range to [1e-8, 1e-3] and run HPO again before creating best.yaml.