Benchmark protocol¶
- Manifest (YAML): declares
mode,source, optionalcontamination, optionalml_training,estimators,metrics,leaderboards,report, andseeds. - Records: synthetic grid (
generator_grid), stress pairs (clean + contaminated), or observational series (csv_series_index/inline_table). - Optional data-driven training: built-in ML/NN estimators train from
ml_trainingand write run-local model artefacts before benchmark estimation. - Estimation: each
(record, estimator_spec)yields anEstimateResult. - Evaluation: mode-appropriate metrics (
MetricBundle) and optional leaderboards. - Outputs: CSV result store under
reports/<run_id>/plus HTML/CSV summaries from the reporter.
Example suite manifests: configs/suites/smoke_*.yaml.
Previewing a benchmark grid¶
Before running an expensive manifest, preview the materialised record-estimator grid:
lrdbench run configs/suites/public_medium_stress_contamination.yaml --dry-run
--dry-run loads and validates the manifest, materialises synthetic or observational records, and
prints the benchmark mode, record count, estimator count, total fit jobs, clean/contaminated split
for stress tests, and global seed. It does not fit estimators, train data-driven models, write
reports, or populate caches. The same capability is available programmatically through
BenchmarkRunner.preview().
A machine-readable JSON Schema for the manifest format is available at
configs/contracts/manifest_schema.json.
Bundled estimator names and target estimands are documented in
Bundled estimators. The default registry includes temporal aggregation
methods (AbsoluteMoment, Variance, and VarianceResidual) alongside RS, DFA, DMA,
spectral, geometric, wavelet, and data-driven estimators.
Data-driven baseline suites¶
configs/suites/smoke_data_driven.yaml is the tutorial-scale suite for run-local supervised
baselines. It trains MLRandomForest and MLSVR from a manifest-declared fGn training grid, then
benchmarks them alongside RS in stress-test mode:
pip install -e ".[ml,reports]"
lrdbench validate configs/suites/smoke_data_driven.yaml
lrdbench run configs/suites/smoke_data_driven.yaml
The trained models and training summary are exported under reports/<run_id>/ml_models/ and
indexed in raw/artefacts.csv. For details, see Data-driven estimators.
For interpretation rules covering aggregation, uncertainty, leaderboards, and failures, see Interpretation semantics.
Public small suites¶
The tracked public_small_* manifests are the stable tutorial-scale public benchmark set. They are
larger than CI smoke tests but intended to remain laptop-runnable from a clean clone:
lrdbench validate configs/suites/public_small_canonical_ground_truth.yaml
lrdbench run configs/suites/public_small_canonical_ground_truth.yaml
Installed packages also expose tracked suites by name:
lrdbench list-suites
lrdbench validate public_small_canonical_ground_truth
lrdbench run public_small_canonical_ground_truth
Available suites:
configs/suites/public_small_canonical_ground_truth.yaml: fGn ground-truth accuracy and disagreement for RS, DFA, and DMA.configs/suites/public_small_stress_contamination.yaml: controlled level-shift, outlier, and polynomial-trend stress testing on fGn.configs/suites/public_small_null_false_positive.yaml: short-memory boundary checks using fGnH=0.5and ARFIMAd=0.0.configs/suites/public_small_sensitivity_disagreement.yaml: scale/window parameter variants for DFA, DMA, and WaveletOLS.
By default these suites write reports under reports/public_small/<run_id>/.
Expected output shape and local reference run counts are recorded in Public small outputs. The machine-readable report and result-store contract is recorded in Output contract.
Public medium suites¶
The tracked public_medium_* manifests are stable public suites for more serious local
benchmark campaigns. They use broader grids, more replicates, more estimators, or richer
contamination designs than the public-small suites. They are intended for laptop or workstation
runs, not CI smoke checks.
Available suites:
configs/suites/public_medium_canonical_ground_truth.yaml: fGn canonical accuracy, disagreement, and paired benchmark uncertainty for RS, DFA, DMA, GHE, and WaveletOLS.configs/suites/public_medium_stress_contamination.yaml: level-shift, outlier, polynomial-trend, and heavy-tail-noise stress testing on fGn.configs/suites/public_medium_null_false_positive.yaml: larger short-memory boundary checks using fGnH=0.5and ARFIMAd=0.0.configs/suites/public_medium_sensitivity_disagreement.yaml: three-variant scale/window sensitivity grids for DFA, DMA, and WaveletOLS.
By default these suites write reports under reports/public_medium/<run_id>/.
Reference output row counts are recorded in
Public medium outputs.
Report completeness¶
The CSV/HTML reporter emits audit-oriented tables in addition to metric summaries:
tables/estimator_metadata.csv: enrolled estimators, families, target estimands, assumptions, parameters, and version metadata.tables/failures.csv: per-estimator/per-stratum missing metric values, missing uncertainty counts, invalid estimate counts, and invalid rates whenvalidity_raterows are available.manifest/environment.json: Python/platform/package metadata plus seed, execution, and uncertainty settings.artefacts/artefact_index.csv: one row per exported report artefact.
The raw result store also persists report artefact metadata to raw/artefacts.csv.
The HTML report renders compact sections for failure summaries, estimator disagreement,
scale/window sensitivity, benchmark uncertainty, uncertainty calibration, and audit artefact links.
When latex is requested, the reporter also writes publication-oriented tables for disagreement,
sensitivity, benchmark uncertainty, and failures under latex/.
Optional report.figure_set entries:
degradation_curve: stress-test degradation bar plot by contamination operator.disagreement_heatmap: aggregate estimator-disagreement heatmap.sensitivity_heatmap: aggregate scale/window-sensitivity heatmap.benchmark_uncertainty_intervals: point estimates with bootstrap interval error bars.false_positive_lrd: balanced-global false-positive LRD rate bar plot.
Figure generation is part of the core reporting contract and uses the standard plotting stack
(matplotlib and seaborn). Requested figures are omitted only when the relevant data is absent;
if suitable data exists but plotting cannot be imported, the reporter raises a clear runtime error.
Estimator disagreement¶
Truth-free disagreement metrics are admissible in all benchmark modes when multiple estimators are declared:
cross_estimator_dispersion: population standard deviation of valid point estimates within each record.pairwise_estimator_disagreement: absolute point-estimate difference for each estimator pair within each record.family_level_disagreement: mean absolute disagreement within estimator families and between estimator-family pairs.
These metrics preserve record-level estimator pairing and aggregate using the same stratum and
balanced-global rules as other metrics. The reporter exports them to
tables/estimator_disagreement.csv.
Scale/window sensitivity¶
An estimator entry may declare parameter variants to evaluate sensitivity to plausible scale, window, or tuning choices:
estimators:
- name: DFA
family: temporal
target_estimand: hurst_scaling_proxy
params:
n_bootstrap: 0
variants:
- name: short_scales
params:
min_scale: 8
max_scale: 32
- name: long_scales
params:
min_scale: 16
max_scale: 64
Variants are materialised as estimator names of the form DFA::short_scales while preserving the
base registry estimator DFA for execution. Truth-free sensitivity metrics are admissible in all
benchmark modes:
parameter_variant_sensitivity: population standard deviation of valid variant point estimates for each base estimator within each record.max_variant_drift: maximum absolute point-estimate difference across valid variants for each base estimator within each record.
The reporter exports these rows to tables/scale_window_sensitivity.csv.
Benchmark-level uncertainty¶
Optional YAML block uncertainty:
enabled(boolean, default false when the block is absent): compute benchmark-level uncertainty rows.n_bootstrap(integer >= 1, default 200): bootstrap replicates.ci_levels(list of confidence levels, default[0.95]): percentile interval levels.seed(integer, optional): bootstrap RNG seed; defaults toseeds.global_seed.metrics(list of metric names, optional): restrict aggregate bootstrap intervals to selected metrics.paired(boolean, default false): also compute paired bootstrap intervals for estimator differences on records where both estimators have the same per-record metric.paired_metrics(list of metric names, optional): restrict paired differences separately.
Aggregate intervals bootstrap records within each stratum and bootstrap stratum summaries for
balanced-global rows. Paired intervals preserve record-level estimator pairing and report mean
differences as EstimatorA__minus__EstimatorB. The reporter exports rows to
tables/benchmark_uncertainty.csv, and the raw result store records them with
scope=uncertainty.
Bootstrap methodology¶
Estimator-level confidence intervals use the circular block bootstrap (CBB) with a fixed block length. The default block length for each estimator is:
max(4, n // 10)
where n is the record length. This is a pragmatic compromise, not a data-adaptive choice.
It respects local dependence structure (essential for LRD series) while remaining reproducible
from the manifest alone.
You can override the block length per estimator:
estimators:
- name: DFA
params:
n_bootstrap: 200
bootstrap_block_len: 32
There is currently no automatic block-length selection (e.g. Politis–White). If you need a
data-adaptive length, compute it externally and set bootstrap_block_len explicitly.
Execution (Phase 5)¶
Optional YAML block execution:
max_workers(integer ≥ 1, default 1): when greater than 1, estimatorfitcalls run in parallel with a thread pool (order of results matches the serial(record × estimator)grid).estimate_cache_dir(string, optional): directory for pickle caches ofEstimateResultkeyed by record id, estimator name, parameter schema, and a hash of the series values. Resolve relative paths against the same working base as observational CSV paths (manifest directory when usinglrdbench run <file.yaml>, else current working directory /base_dirfor programmatic runs).cache_read/cache_write(booleans, default true): control cache lookup and population.
Only use estimate caches from trusted locations (pickle execution model).