Data-Driven Estimators¶

lrdbench includes experimental supervised baselines for comparing classical LRD estimators with data-driven approaches under the same benchmark protocol.

Built-in data-driven estimators:

MLRandomForest: scikit-learn random forest regressor.
MLSVR: scikit-learn support vector regressor.
MLCNN: PyTorch 1D convolutional neural-network regressor.
MLLSTM: PyTorch LSTM regressor.

All four currently target hurst_scaling_proxy. They are baselines, not reference-grade LRD estimators. Interpret their results relative to the manifest-declared training distribution.

Installation¶

Feature-based ML baselines:

pip install "lrdbench[ml,reports]"

Neural-network baselines:

pip install "lrdbench[nn,reports]"

From a source checkout:

pip install -e ".[ml,reports]"

The packaged smoke_data_driven suite uses MLRandomForest and MLSVR, so it does not require PyTorch.

Programmatic example¶

The repository includes a complete script:

python examples/data_driven_baseline_benchmark.py

It builds a small stress-test manifest in memory, trains MLRandomForest and MLSVR, benchmarks them against RS, and prints the report and model artefact paths.

Run the smoke suite¶

From an installed package:

lrdbench validate smoke_data_driven
lrdbench run smoke_data_driven

From a source checkout:

lrdbench validate configs/suites/smoke_data_driven.yaml
lrdbench run configs/suites/smoke_data_driven.yaml

The run writes trained model artefacts under:

reports/<run_id>/ml_models/

The raw artefact table records the model files and training summary:

reports/<run_id>/raw/artefacts.csv

Manifest shape¶

Data-driven estimators use the same estimators list as classical estimators, plus a top-level ml_training block:

ml_training:
  enabled: true
  target_estimand: hurst_scaling_proxy
  validation_fraction: 0.25
  source:
    type: generator_grid
    generators:
      - family: fGn
        params:
          H: [0.35, 0.5, 0.65, 0.8]
          n: [128]
          sigma: [1.0]
        replicates: 2
  contamination:
    include_clean: true
    operators:
      - name: level_shift
        params:
          shift: [0.2]

estimators:
  - name: RS
    family: temporal
    target_estimand: hurst_scaling_proxy
    supports_ci: false
    supports_diagnostics: true
    params:
      n_bootstrap: 0
  - name: MLRandomForest
    family: data_driven
    target_estimand: hurst_scaling_proxy
    assumptions: [trained_on_manifest_synthetic_distribution]
    supports_ci: false
    supports_diagnostics: true
    params:
      n_estimators: 100
      random_state: 7
      max_lag: 16
  - name: MLLSTM
    family: data_driven
    target_estimand: hurst_scaling_proxy
    assumptions: [trained_on_manifest_synthetic_distribution]
    supports_ci: false
    supports_diagnostics: true
    params:
      sequence_length: 256
      hidden_size: 32
      num_layers: 1
      dropout: 0.2
      learning_rate: 0.001
      weight_decay: 0.0001
      batch_size: 16
      epochs: 10
      random_state: 7

ml_training.source is independent of the benchmark evaluation source. The runner trains models once per run, writes model artefacts, injects the trained model paths into estimator metadata, and then executes the normal (record, estimator) fit loop.

Pretrained artefacts¶

For frozen model comparisons, provide a model path directly in estimator params and omit ml_training:

estimators:
  - name: MLRandomForest
    family: data_driven
    target_estimand: hurst_scaling_proxy
    supports_ci: false
    supports_diagnostics: true
    params:
      model_path: reports/example/ml_models/MLRandomForest.pkl

Use only trusted model artefacts. The scikit-learn baselines use pickle-compatible artefacts.

Interpretation¶

Data-driven baselines can be useful for testing whether supervised models learn robustness patterns that classical estimators miss under declared contaminations. They also introduce new risks:

training distribution shift;
leakage if training and evaluation grids are not kept distinct;
overfitting on small training grids;
hyperparameter sensitivity;
no estimator-level confidence intervals in the initial implementation.

When reporting these results, include the benchmark manifest, ml_training block, estimator metadata table, and ml_models/training_summary.json.