User Manual¶

Installation¶

Prerequisites¶

Python 3.11 or later
uv (or another virtual environment manager)
Access to the LEGEND metadata repository

Clone and install the package:

git clone https://github.com/legend-exp/legend-dataflow.git
cd legend-dataflow
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"

The [dev] extras include development tools such as testing and linting dependencies. For production use, omit [dev].

Installing the software environment¶

All software versions are stored in the pyproject.toml file. These can be releases of the various packages e.g. dbetto==1.2.4 or local versions dbetto @ file:///${PROJECT_ROOT}/dbetto.

After configuring dataflow-config.yaml for your site (see Configuration below), install the execution environment:

dataflow -v install -s <host> dataflow-config.yaml

where <host> is one of the execution environments defined in the config (e.g. bare, lngs, sator, nersc). This installs all required software into .snakemake/legend-dataflow/venv.

Note

If you update the software, clear the numba cache directory (defined in the config under execenv.<host>.env.NUMBA_CACHE_DIR) to avoid stale compiled code.

Configuration¶

Data processing resources are configured via a single site-dependent YAML file, conventionally named dataflow-config.yaml. The default file in the repository root can serve as a starting point.

Key settings¶

Parameter	Description
`legend_metadata_version`	Version tag of legend-metadata to check out automatically. Can be either specified as a tag e.g. `v1.2.0` or a branch e.g. `main`. If you want to edit the metadata in your cycle this line should be commented out.
`allow_none_par`	If `false`, the workflow aborts when calibration parameter generation fails. If `true`, it continues with default or overridden parameters.
`build_file_dbs`	Whether to generate `pygama.flow.FileDB` databases after each successful production run.
`check_log_files`	Whether to scan log files for errors/warnings after each run.
`multiprocess`	Enable parallel processing within a single Snakemake job.
``pht_intier ``	Which tier to use as input for pht can be `dsp` or `psp`

Paths¶

All path values support the $_ placeholder, which is substituted with the value of the $PRODENV environment variable at runtime. This allows a single config file to be used across different machines by setting PRODENV appropriately.

The key path categories are:

Input paths: metadata, config, par_overwrite, chan_map, detector_status, detector_db
Output tier paths: tier_raw, tier_tcm, tier_dsp, tier_hit, tier_psp, tier_pht, tier_ann, tier_evt, tier_pan, tier_pet, tier_skm
Parameter directories: par_raw, par_tcm, par_dsp, par_hit, par_psp, par_pht, par_pet
Scratch/log directories: tmp_plt, tmp_log, tmp_filelists, tmp_par, log, plt

All generated paths can point to files in the cycle to make new files or it can point to files from another cycle e.g. “tier_dsp” : “/data2/public/prodenv/prod-blind/ref/v2.0.0” and in this case those files will be used as an input. The metadata version to use is un this file under legend_metadata_version and can be either specified as a tag e.g. v1.2.0 or a branch e.g. main. If you want to edit the metadata in your cycle this line should be commented out.

Execution environments¶

The execenv section defines how scripts are executed. Each named environment specifies an optional container command and the environment variables to set:

bare – run directly in the local Python environment (no container)
lngs – Apptainer container on the LNGS cluster
sator – Apptainer container on the Sator cluster
nersc – Shifter container at NERSC

To add a new environment, add a new key under execenv following the same pattern.

Profiles¶

Snakemake execution profiles are stored in workflow/profiles/. Each profile is a directory containing a config.yaml that sets Snakemake options such as the number of cores and memory constraints.

The available profiles are:

default – bare-metal execution using all available cores
lngs – LNGS computing cluster settings
sator – Sator computing cluster settings
lngs-build-raw – settings specific to raw data building at LNGS

Specify a profile with the --profile flag:

snakemake --profile workflow/profiles/lngs all-l200-p03-r001-phy-skm.gen

A full list of configurable Snakemake options is available in the Snakemake CLI documentation.

Running the Dataflow¶

The $PRODENV environment variable must be set to the root of your production environment before running Snakemake:

export PRODENV=/path/to/your/production/environment

Single-file targets¶

At the most basic level you can ask Snakemake to build a single output file, and it will work out all the dependencies automatically:

snakemake /path/to/generated/tier/dsp/p03/r000/l200-p03-r000-cal-20230401T000000Z-tier_dsp.lh5

Batch targets with `.gen` files¶

In practice, you will want to process many files at once. The special .gen target format triggers processing of all matching files up to a given tier.

The target format is:

[all|valid]-{experiment}-{period}-{run}-{datatype}-{tier}.gen

where:

all / valid – process all data, or only data selected for analysis,

any keyword in runlists.yaml in legend-datasets is a possible option. - experiment – experiment name (e.g. l200) - period – data-taking period (e.g. p03) - run – run number (e.g. r001) - datatype – data type (e.g. phy for physics, cal for calibration) - tier – output tier to build up to (e.g. dsp, hit, skm)

Any component except tier can be replaced by a wildcard (*) to match all values, or a _-separated list to match multiple specific values.

Examples:

# Process all physics data from period p03, run r001, through to the SKM tier
snakemake all-l200-p03-r001-phy-skm.gen

# Process all calibration data from any run in period p03 to the DSP tier
snakemake all-l200-p03-*-cal-dsp.gen

# Process physics data from runs r000 and r001 to the HIT tier
snakemake all-l200-p03-r000_r001-phy-hit.gen

# Process analysis-selected physics data from any period and run to SKM
snakemake valid-l200-*-*-phy-skm.gen

On success, the empty marker file {label}-{tier}.gen is created to record that production completed successfully.

Post-processing¶

On successful completion, the workflow automatically:

Collects warnings and errors from individual log files into a summary log
Generates a Snakemake HTML report saved under the log directory
Builds pygama.flow.FileDB databases for the output files (if enabled)
Generates lists of valid file keys
Writes parameter validity catalog files for each tier

Monitoring¶

You can use the snkmt TUI for monitoring. Available with snkmt --console

Software Containers¶

The dataflow uses container environments for reproducible execution on HPC systems. Rather than Snakemake’s built-in Singularity/Apptainer support, it manages containers through the execenv configuration, giving finer control over which commands are containerised.

Container settings are only required when Snakemake itself runs outside the container (e.g. when submitting jobs to a batch system). If the entire workflow runs inside the container, no special container configuration is needed.

Supported container runtimes:

Apptainer (formerly Singularity) – used at LNGS and Sator
Shifter – used at NERSC

The container image path and runtime command are configured per environment in the execenv section of dataflow-config.yaml.

Parameter Overrides¶

Calibration parameters can be overridden on a per-channel, per-run basis by placing override files in the directory specified by paths/par_overwrite in the configuration. These take precedence over parameters derived by the automatic calibration pipeline.

The override directory follows the same hierarchical structure as the parameter output directories.

Run Validity and Ignored Cycles¶

Two files in the detector_status directory control which data are included:

ignored_daq_cycles.yaml – lists DAQ cycles to exclude from processing entirely
run_override.yaml – allows overriding the validity window for specific runs,

e.g. to apply a previous valid calibration to a subsequent run

These files are part of the legend-datasets repository.

User Manual¶

Installation¶

Prerequisites¶

Installing the software environment¶

Configuration¶

Key settings¶

Paths¶

Execution environments¶

Profiles¶

Running the Dataflow¶

Single-file targets¶

Batch targets with .gen files¶

Post-processing¶

Monitoring¶

Software Containers¶

Parameter Overrides¶

Run Validity and Ignored Cycles¶

Batch targets with `.gen` files¶