Developer Guide¶
Repository Structure¶
legend-dataflow/
├── workflow/
│ ├── Snakefile # Main workflow entry point
│ ├── Snakefile-build-raw # Separate workflow for raw data building
│ ├── rules/ # Snakemake rule files (one per processing stage)
│ ├── profiles/ # Execution profiles (default, lngs, sator, ...)
│ └── src/legenddataflow/
│ ├── methods/ # Core library: file keys, patterns, calibration grouping
│ └── scripts/ # Executable scripts called by rules
│ ├── flow/ # Workflow management scripts
│ ├── tier/ # Data tier building scripts
│ └── par/ # Parameter generation scripts
│ ├── geds/ # HPGe detector parameters
│ └── spms/ # SiPM detector parameters
├── docs/ # Sphinx documentation
├── tests/ # Test suite
├── dataflow-config.yaml # Default configuration
└── pyproject.toml # Package metadata and dependencies
Snakemake Rules¶
The workflow is built around Snakemake rules that define how output files are derived
from input files. Rules are organised into one file per processing stage under
workflow/rules/.
Each rule specifies:
input – files or Python callables that resolve input paths from wildcards
output – the file(s) the rule produces
params – additional parameters passed to the script
log – where to write execution logs
threads – number of CPU threads to request
script – the Python script (or shell command) to execute
Full details are in the Snakemake rules documentation.
Rule naming conventions¶
build_tier_{tier}– rules that build a data tier from input databuild_{tier}_pars_{detector}– rules that derive calibration parameters for a detector type (gedsfor HPGe,spmsfor SiPM)Rules operating on partition-level data use the
psp/pht/pan/pettier names
Parameter generation pattern¶
Most tiers follow the same two-step pattern:
Parameter generation rules – run on calibration data (
caldatatype) to derive per-channel calibration parameters. These producepar_{tier}.yamlfiles.Tier building rules – apply those parameters to both calibration and physics data to produce tier files.
For most tiers there are two versions:
Run-level – uses a single calibration run
Partition-level – groups multiple runs together (defined in
cal_groupings.yaml) for improved statistical precision
Scripts¶
Scripts live under workflow/src/legenddataflow/scripts/ and are called by rules
via Snakemake’s shell: directive. They receive inputs, outputs, and parameters
from Snakemake through the snakemake object.
Script categories:
flow/– workflow management: file discovery, channel lists, run finalisation, file mergingtier/– data tier building: raw conversion, TCM, event reconstruction, skimpar/geds/– HPGe parameter generation scripts organised by tier (raw/,tcm/,psp/,pht/)par/spms/– SiPM parameter generation scripts
Methods Library¶
The workflow/src/legenddataflow/methods/ package provides shared utilities:
FileKey (
FileKey.py) – parses and generates file keys and patterns from wildcard components (experiment, period, run, datatype, timestamp)patterns (
patterns.py) – defines file path patterns for each tier and processing stagepaths (
paths.py) – resolves configured paths from the config fileCalGrouping (
cal_grouping.py) – loads and queries partition-level calibration groupings fromcal_groupings.yamlParsKeyResolve (
create_pars_keylist.py) – resolves which parameter files apply to a given file key, using the parameter validity catalogParsCatalog (
pars_loading.py) – loads and manages parameter catalog files
Adding a new processor to the dsp¶
If the processor is already in dspeed this is simple just add the processor to the
relevant config in legend-dataflow-config under tier/dsp. If the processor
isn’t yet in dspeed then either open a pr and add it or run using a local version which
can be specified in the pyproject.toml
Adding a new calibration script¶
Add the rule to the relevant file e.g.
workflow/rules/hit_pars_geds.smkwith:A parameter generation rule (if applicable) that reads calibration data and writes e.g.
par_hit_mystep.yaml
Write the script(s) –
scripts/par/geds/hit/implementing the processing logic.
Adding a new processing stage¶
To add a new tier foo:
Add path configuration – add
tier_fooandpar_fooentries todataflow-config.yamland thepaths.pyhelper.Add file patterns – add pattern definitions for the new tier in
methods/patterns.py.Write the rule file – create
workflow/rules/foo.smkwith:A parameter generation rule (if applicable) that reads calibration data and writes
par_foo.yamlA tier building rule that applies parameters and writes tier files
Write the script(s) – add scripts to
scripts/tier/foo.pyand/orscripts/par/geds/foo/implementing the processing logic.Include the rule file – add
include: "rules/foo.smk"to the mainworkflow/Snakefilein the appropriate order (after its dependencies).Update table_format – add the HDF5 table naming pattern for the new tier to the
table_formatsection of the config.
Adding a new execution environment¶
To add a new host or container environment:
Add a new key under
execenvindataflow-config.yamlwith the container command, image path, and required environment variables.Add a matching profile directory under
workflow/profiles/<hostname>/with aconfig.yamlspecifying Snakemake options appropriate for that host.
Testing¶
The test suite uses pytest and is located in tests/.
Run tests with:
pytest tests/
Code style follows PEP 8. The project uses ruff for linting and formatting, which
is enforced in CI. Run locally with:
ruff check .
ruff format .