Developer Guide¶

Repository Structure¶

legend-dataflow/
├── workflow/
│   ├── Snakefile                  # Main workflow entry point
│   ├── Snakefile-build-raw        # Separate workflow for raw data building
│   ├── rules/                     # Snakemake rule files (one per processing stage)
│   ├── profiles/                  # Execution profiles (default, lngs, sator, ...)
│   └── src/legenddataflow/
│       ├── methods/               # Core library: file keys, patterns, calibration grouping
│       └── scripts/               # Executable scripts called by rules
│           ├── flow/              # Workflow management scripts
│           ├── tier/              # Data tier building scripts
│           └── par/               # Parameter generation scripts
│               ├── geds/          # HPGe detector parameters
│               └── spms/          # SiPM detector parameters
├── docs/                          # Sphinx documentation
├── tests/                         # Test suite
├── dataflow-config.yaml           # Default configuration
└── pyproject.toml                 # Package metadata and dependencies

Snakemake Rules¶

The workflow is built around Snakemake rules that define how output files are derived from input files. Rules are organised into one file per processing stage under workflow/rules/.

Each rule specifies:

input – files or Python callables that resolve input paths from wildcards
output – the file(s) the rule produces
params – additional parameters passed to the script
log – where to write execution logs
threads – number of CPU threads to request
script – the Python script (or shell command) to execute

Full details are in the Snakemake rules documentation.

Rule naming conventions¶

build_tier_{tier} – rules that build a data tier from input data
build_{tier}_pars_{detector} – rules that derive calibration parameters for a detector type (geds for HPGe, spms for SiPM)
Rules operating on partition-level data use the psp / pht / pan / pet tier names

Parameter generation pattern¶

Most tiers follow the same two-step pattern:

Parameter generation rules – run on calibration data (cal datatype) to derive per-channel calibration parameters. These produce par_{tier}.yaml files.
Tier building rules – apply those parameters to both calibration and physics data to produce tier files.

For most tiers there are two versions:

Run-level – uses a single calibration run
Partition-level – groups multiple runs together (defined in cal_groupings.yaml) for improved statistical precision

Scripts¶

Scripts live under workflow/src/legenddataflow/scripts/ and are called by rules via Snakemake’s shell: directive. They receive inputs, outputs, and parameters from Snakemake through the snakemake object.

Script categories:

flow/ – workflow management: file discovery, channel lists, run finalisation, file merging
tier/ – data tier building: raw conversion, TCM, event reconstruction, skim
par/geds/ – HPGe parameter generation scripts organised by tier (raw/, tcm/, psp/, pht/)
par/spms/ – SiPM parameter generation scripts

Methods Library¶

The workflow/src/legenddataflow/methods/ package provides shared utilities:

FileKey (FileKey.py) – parses and generates file keys and patterns from wildcard components (experiment, period, run, datatype, timestamp)
patterns (patterns.py) – defines file path patterns for each tier and processing stage
paths (paths.py) – resolves configured paths from the config file
CalGrouping (cal_grouping.py) – loads and queries partition-level calibration groupings from cal_groupings.yaml
ParsKeyResolve (create_pars_keylist.py) – resolves which parameter files apply to a given file key, using the parameter validity catalog
ParsCatalog (pars_loading.py) – loads and manages parameter catalog files

Adding a new processor to the dsp¶

If the processor is already in dspeed this is simple just add the processor to the relevant config in legend-dataflow-config under tier/dsp. If the processor isn’t yet in dspeed then either open a pr and add it or run using a local version which can be specified in the pyproject.toml

Adding a new calibration script¶

Add the rule to the relevant file e.g. workflow/rules/hit_pars_geds.smk with:
- A parameter generation rule (if applicable) that reads calibration data and writes e.g. par_hit_mystep.yaml
Write the script(s) – scripts/par/geds/hit/ implementing the processing logic.

Adding a new processing stage¶

To add a new tier foo:

Add path configuration – add tier_foo and par_foo entries to dataflow-config.yaml and the paths.py helper.
Add file patterns – add pattern definitions for the new tier in methods/patterns.py.
Write the rule file – create workflow/rules/foo.smk with:
- A parameter generation rule (if applicable) that reads calibration data and writes par_foo.yaml
- A tier building rule that applies parameters and writes tier files
Write the script(s) – add scripts to scripts/tier/foo.py and/or scripts/par/geds/foo/ implementing the processing logic.
Include the rule file – add include: "rules/foo.smk" to the main workflow/Snakefile in the appropriate order (after its dependencies).
Update table_format – add the HDF5 table naming pattern for the new tier to the table_format section of the config.

Adding a new execution environment¶

To add a new host or container environment:

Add a new key under execenv in dataflow-config.yaml with the container command, image path, and required environment variables.
Add a matching profile directory under workflow/profiles/<hostname>/ with a config.yaml specifying Snakemake options appropriate for that host.

Testing¶

The test suite uses pytest and is located in tests/. Run tests with:

pytest tests/

Code style follows PEP 8. The project uses ruff for linting and formatting, which is enforced in CI. Run locally with:

ruff check .
ruff format .