User Manual¶
Installation¶
Prerequisites¶
Python 3.11 or later
uv (or another virtual environment manager)
Access to the LEGEND metadata repository
Clone and install the package:
git clone https://github.com/legend-exp/legend-dataflow.git
cd legend-dataflow
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"
The [dev] extras include development tools such as testing and linting
dependencies. For production use, omit [dev].
Installing the software environment¶
All software versions are stored in the pyproject.toml file. These can be releases of the
various packages e.g. dbetto==1.2.4 or local versions dbetto @ file:///${PROJECT_ROOT}/dbetto.
After configuring dataflow-config.yaml for your site (see Configuration
below), install the execution environment:
dataflow -v install -s <host> dataflow-config.yaml
where <host> is one of the execution environments defined in the config (e.g.
bare, lngs, sator, nersc). This installs all required software into
.snakemake/legend-dataflow/venv.
Note
If you update the software, clear the numba cache directory (defined in the config
under execenv.<host>.env.NUMBA_CACHE_DIR) to avoid stale compiled code.
Configuration¶
Data processing resources are configured via a single site-dependent YAML file,
conventionally named dataflow-config.yaml. The default file in the repository root
can serve as a starting point.
Key settings¶
Parameter |
Description |
|---|---|
|
Version tag of legend-metadata to check out automatically.
Can be either specified as a tag e.g. |
|
If |
|
Whether to generate |
|
Whether to scan log files for errors/warnings after each run. |
|
Enable parallel processing within a single Snakemake job. |
``pht_intier `` |
Which tier to use as input for pht can be |
Paths¶
All path values support the $_ placeholder, which is substituted with the value of
the $PRODENV environment variable at runtime. This allows a single config file to
be used across different machines by setting PRODENV appropriately.
The key path categories are:
Input paths:
metadata,config,par_overwrite,chan_map,detector_status,detector_dbOutput tier paths:
tier_raw,tier_tcm,tier_dsp,tier_hit,tier_psp,tier_pht,tier_ann,tier_evt,tier_pan,tier_pet,tier_skmParameter directories:
par_raw,par_tcm,par_dsp,par_hit,par_psp,par_pht,par_petScratch/log directories:
tmp_plt,tmp_log,tmp_filelists,tmp_par,log,plt
All generated paths can point to files in the cycle to make new files or it can point to files
from another cycle e.g. “tier_dsp” : “/data2/public/prodenv/prod-blind/ref/v2.0.0”
and in this case those files will be used as an input. The metadata version to use
is un this file under legend_metadata_version and can be either specified as a tag
e.g. v1.2.0 or a branch e.g. main. If you want to edit the metadata in your cycle
this line should be commented out.
Execution environments¶
The execenv section defines how scripts are executed. Each named environment
specifies an optional container command and the environment variables to set:
bare– run directly in the local Python environment (no container)lngs– Apptainer container on the LNGS clustersator– Apptainer container on the Sator clusternersc– Shifter container at NERSC
To add a new environment, add a new key under execenv following the same pattern.
Profiles¶
Snakemake execution profiles are stored in workflow/profiles/. Each profile is a
directory containing a config.yaml that sets Snakemake options such as the number
of cores and memory constraints.
The available profiles are:
default– bare-metal execution using all available coreslngs– LNGS computing cluster settingssator– Sator computing cluster settingslngs-build-raw– settings specific to raw data building at LNGS
Specify a profile with the --profile flag:
snakemake --profile workflow/profiles/lngs all-l200-p03-r001-phy-skm.gen
A full list of configurable Snakemake options is available in the Snakemake CLI documentation.
Running the Dataflow¶
The $PRODENV environment variable must be set to the root of your production
environment before running Snakemake:
export PRODENV=/path/to/your/production/environment
Single-file targets¶
At the most basic level you can ask Snakemake to build a single output file, and it will work out all the dependencies automatically:
snakemake /path/to/generated/tier/dsp/p03/r000/l200-p03-r000-cal-20230401T000000Z-tier_dsp.lh5
Batch targets with .gen files¶
In practice, you will want to process many files at once. The special .gen target
format triggers processing of all matching files up to a given tier.
The target format is:
[all|valid]-{experiment}-{period}-{run}-{datatype}-{tier}.gen
where:
all/valid– process all data, or only data selected for analysis,
any keyword in runlists.yaml in legend-datasets is a possible option.
- experiment – experiment name (e.g. l200)
- period – data-taking period (e.g. p03)
- run – run number (e.g. r001)
- datatype – data type (e.g. phy for physics, cal for calibration)
- tier – output tier to build up to (e.g. dsp, hit, skm)
Any component except tier can be replaced by a wildcard (*) to match all
values, or a _-separated list to match multiple specific values.
Examples:
# Process all physics data from period p03, run r001, through to the SKM tier
snakemake all-l200-p03-r001-phy-skm.gen
# Process all calibration data from any run in period p03 to the DSP tier
snakemake all-l200-p03-*-cal-dsp.gen
# Process physics data from runs r000 and r001 to the HIT tier
snakemake all-l200-p03-r000_r001-phy-hit.gen
# Process analysis-selected physics data from any period and run to SKM
snakemake valid-l200-*-*-phy-skm.gen
On success, the empty marker file {label}-{tier}.gen is created to record
that production completed successfully.
Post-processing¶
On successful completion, the workflow automatically:
Collects warnings and errors from individual log files into a summary log
Generates a Snakemake HTML report saved under the log directory
Builds
pygama.flow.FileDBdatabases for the output files (if enabled)Generates lists of valid file keys
Writes parameter validity catalog files for each tier
Monitoring¶
You can use the snkmt TUI for monitoring. Available with snkmt --console
Software Containers¶
The dataflow uses container environments for reproducible execution on HPC systems.
Rather than Snakemake’s built-in Singularity/Apptainer support, it manages containers
through the execenv configuration, giving finer control over which commands are
containerised.
Container settings are only required when Snakemake itself runs outside the container (e.g. when submitting jobs to a batch system). If the entire workflow runs inside the container, no special container configuration is needed.
Supported container runtimes:
Apptainer (formerly Singularity) – used at LNGS and Sator
Shifter – used at NERSC
The container image path and runtime command are configured per environment in the
execenv section of dataflow-config.yaml.
Parameter Overrides¶
Calibration parameters can be overridden on a per-channel, per-run basis by placing
override files in the directory specified by paths/par_overwrite in the
configuration. These take precedence over parameters derived by the automatic
calibration pipeline.
The override directory follows the same hierarchical structure as the parameter output directories.
Run Validity and Ignored Cycles¶
Two files in the detector_status directory control which data are included:
ignored_daq_cycles.yaml– lists DAQ cycles to exclude from processing entirelyrun_override.yaml– allows overriding the validity window for specific runs,
e.g. to apply a previous valid calibration to a subsequent run
These files are part of the legend-datasets repository.