Coherent set of packages, tools, and
standards for future colliders
-
Common effort from FCC, CLIC/ILC, EIC, CEPC, Muon Collider,
…
-
Preserves and adds onto existing functionality from
iLCSoft, FCCSW, CEPCSW, …
-
Builds on top of the experience from LHC experiments
and results of targeted R&D (AIDA, …)
-
Many institutes involved: CERN, DESY, IHEP, INFN, IJCLab,
…
-
Each project rebases its stack on top of Key4hep
-
Having common building blocks enables synergies across
collider communities
-
Main ingredients:
-
Event data model:
EDM4hep,
based on PODIO, AIDA project
-
Event processing framework:
Gaudi,
used in LHCb, ATLAS, …
-
Detector description:
DD4hep,
AIDA project
-
System to build, test and deploy:
Spack,
suggested by HSF + CVMFS
FCCAnalyses
-
FCCAnalyses — common analysis framework for FCC:
-
ROOT RDataFrame based
-
Provides analyzer functions/functors of varying complexity
- Analysis is composed out of those
-
Written in C++ and Python
- Handles input dataset metadata
- Manages running of the dataframe: locally or on HTCondor
-
Events data can be read in directly by RDF or through
RDataSource
-
Accompanying infrastructure of FCCAnalyses
-
Getting help
Example Higgs recoil analysis
Where to get FCCAnalyses
-
Latest tag:
v0.12.0 was released two
weeks ago
-
FCCAnalyses is distributed in the Key4hep stack:
-
Key4hep Release:
source /cvmfs/sw.hsf.org/key4hep/setup.sh
-
Key4hep Nightlies:
source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh
-
Developing FCCAnalyses:
-
git clone git@github.com:HEP-FCC/FCCAnalyses.git
-
source setup.sh -h
-
Specialized version for
winter2023
datasets can be obtained only from GitHub
-
git clone --branch pre-edm4hep1 git@github.com:HEP-FCC/FCCAnalyses.git
-
Works with fixed Key4hep stack:
2024-03-10
What is an Analyzer
Set of self-contained functions/functors operating on ROOT
dataframe
-
Many functions/functors to run on a dataframe columns
provided from outside of FCCAnalyses (usually low level)
-
More specialized ones provided in FCCAnalyses standard
library
-
A lot missing due to many input/output objects and their
combinations
-
dNdx analyzers now properly use edm4hep::RecDqdxData
-
Current analyzers depend on the following C++ frameworks
-
ROOT — together with RDataFrame
-
ACTS — track reconstruction tools (not fully supported)
-
ONNX — neural network exchange format
-
FastJet — jet finding package
-
DD4hep — detector description
-
Delphes — fast simulations
-
Recent improvements in the ROOT Python interface
-
Fork model of FCCAnalyses creates many copies of the analyzers, which
are shared among different groups
-
The operations on the dataframe happen
with small stateless functions:
-
or with structs, which have internal state:
How to write an analysis
Analysis encapsulated into a class
-
Run stage can be encapsulated into a class:
Analysis
- Possibility to provide parameters from outside
- Avoids evaluating the parts of the analysis script at
the bootstrap stage
-
Histmaker style analysis can also receive CLI arguments
-
Many attributes are being documented in
fccanalysis-script,
fccanalysis-final-script and
fccanalysis-plots-script manual pages
-
The additional analyzers can be provided in a
.hxx header file
- The analyzers will get JIT compiled at startup
-
Old extension method involving additional compilation
with CMake is deprecated
How to run an analysis
Analysis behavior depends on multiple categories of command-line
arguments.
-
The anatomy of the fccanalysis command
line interface:
fccanalysis
<global-args>
<sub-command>
<sub-command-args>
<analysis-script>
--
<script-args>
Example:
fccanalysis
-vv
run
--n-threads 4
my_fcc_analysis.py
--
--pt-min 40
-- (double dash) required to properly
provide the user arguments to the analysis script
-
The verbosity is a global argument and is synchronized with ROOT
RDataFrame
-
Promising
benchmarks
showing recovering of performance when using
podio::DataSource
- Requires work on clever retrieval of collections
-
TupleWriter under development:
k4FWCore#364
- Base algorithm to create NTuples
Where to run an analysis
Making FCCAnalyses distributed
-
FCCAnalyses supports HTCondor submission for a long time
-
fccanalysis submit ana-script.py
-
Easier running on local files with a file list
-
New
-i, --input and
-f, --input-file-list arguments
-
Sample location can be specified individually with
input-dir parameter in the Process
list
-
Running on centralized samples will get reworked
-
Users can submit FCCAnalyses to DIRAC
-
Example workflow
shows how to submit jobs
with
winter2023 compatible
FCCAnalyses
-
One sample per submission
-
Based on DIRAC General Application
Other approaches
Exploring other languages and software ecosystems
-
podio allows writing an analysis in C++ and
Python
-
RDataFrame in C++ w/wo analyzers from FCCAnalyses
-
Coffea — awkward-arrays based columnar analysis framework
-
Julia: high-level, general-purpose dynamic programming language
Future Plans / Challenges
Focus on the flagship analyses important for the benchmarking of
the detector designs
-
Support processing of RNTuple files
-
Expand distributed computing
options
- Allow running on grid GRID, Slurm or other platforms
-
Better ML support
- Streamline current implementation
-
Investigate/adopt ROOT solutions e.g. TMVA SOFIE
-
Join the developments in Key4hep
-
Work on overcoming software
ecosystem silos
- Key4hep, PyHEP, DiracOS, Rucio, …
-
Streamline NTuple production mechanism and improve performance
of directly handling EDM4hep objects
-
Invest time in tooling/approaches which can bridge the gap
-
Make the framework more robust and expand its capabilities
- Merge staged and histmaker styles together
- Bring CMS Combine to the Key4hep stack
- Revive the plotting capabilities
Brace yourselves for more breaking changes!
Trinity of Dataset Production Systems
Designing the dataset production system for the pre-TDR era and
beyond
-
Traditionally the centralized dataset productions are handled
with
- Workload manager — DIRAC/iLCDirac
- Dataset manager — DIRAC/iLCDirac or Rucio
- Metadata manager — FCC Physics Events
-
DIRAC dataset management used by the LHCb
-
Investigations of the capabilities of Rucio underway
-
Capabilities of FCC Physics Events to be expanded
- Support for API calls
- Full dataset provenance
-
Crucial ingredient:
How to address the datasets?
-
At the moment we use:
-
LFNs (Logical File names) in DIRAC, e.g.:
fcc/ee/test_spring2024/240gev/Hbb/CLD_o2_v07/rec/
-
Collection of 5 tags in EventProducer, e.g.:
accelerator, campaign,
stage, detector,
process
FCC VO
Resources available in the FCC Virtual Organization
DIRAC Resources Overview
Rucio Architecture Overview
-
FCC resources administered by FCC VO
FCC VO
-
Registered at
EGI
-
Could be used by
individual users and for
centralized dataset
production
- Support e-group:
fcc-vo-support@cern.ch
-
Resources in the VO:
-
Four fully operational Storage Elements
-
Only one Compute Element in the VO
-
More in various stages of readiness
-
Resource pledges are not rigidly formalized yet
-
Continuing to exercise the system in a test campaign
-
Full campaign will be
launched soon
-
Productions configuration kept in
FCCDIRAC
repository
-
Succesfuly tested job result storage on the operational SEs
-
Testing replications done through DIRAC to CNAF-DISK and
BARI-DISK
-
Rucio pilot replications underway
DIRAC / iLCDirac @ FCC
Interware to exploit distributed heterogeneous resources
-
Created by LHCb, these days
used by Belle2, ILC, CTAO, CEPC, …
-
High level interface to interact with
GRIDs, Clouds, HPCs and
Batch systems
-
iLCDirac:
set of extensions / applications developed for the linear
collider studies
-
Offers data / metadata management
-
File transfers / replications, metadata augmented file
catalog
-
Has web interface to control aspects of the system
Transformations, jobs, accounting, system administration
-
New developments happen under the DiracX project
-
Software needs to be wrapped into a DIRAC application
- Many already available thanks to iLCDirac
-
More to be added to support users/productions of FCC
- MC generators, interfaces for Key4hep based
applications, ...
-
See
talk by A. Sailer
Rucio
Scientific data management
-
Rucio provides a system to manage
experiment datasets across distributed resources
-
Initially developed in ATLAS, now used by a number of large
and small collaborations
-
Showing the ability to scale to very large pools of
data
-
Organizes data in a hierarchy of namespaces, containers,
datasets and files
-
Placement of the data dependent of the replication rules
- Describing storage and time limitations
-
Interfacing between DIRAC and
Rucio already done by Belle2
- FCC Rucio instance managed by the CERN IT
-
See
talk by G. Guerrieri
Centrally produced datasets
Current situation with already produced datasets
-
Datasets include generator level files and parametrized
simulation for FCC-ee and FCC-hh
-
Stored on EOS at:
/eos/experiment/fcc/<accelerator-type>/generation/
-
Detailed information about EventProducer datasets published on
FCC Physics Events
website
-
FCCAnalyses framework automatically picks up the sample
information from YAML and JSON files
The interface under overhaul, moving towards API call(s)
-
Several FCC-hh Fullsim and
Delphes campaigns moved to tape
-
Sample metadata kept on the website for historical
record
FCC Physics Events
Evolving FCC Physics Events into Metadata Manager
-
New FCC Physics Events Website based on modern web
technologies
-
Can ingest multitude of datatypes
-
Core categories are columns in the database, rest of metadata
is stored as a JSON string
-
Already provides API endpoint for serving dataset metadata in
a JSON dictionary
-
See
talk by M. Cechovic
-
Plan to implement dataset provenance
- Open to switching to a more robust tool down the line
Flare
FCCee b2Luigi Automated Reconstruction and Event processing
-
Framework allowing to organize user event processing chain
into dynamic workflows based on
b2luigi
-
Handles scheduling of complex workflows well
-
Orchestration happens at the level of executables
-
Integrated into the LCG based Key4hep stack
-
See
talk by C. Harris
Conclusions
-
The FCCAnalyses framework evolves from the needs for the FSR case studies
-
Interface breaking changes
are being gradually introduced
-
FullSim flagship analyses
will require multitude of adjustments: analyzers, distributed
computing, metadata access, …
-
Integrate or lower to threshold for interfacing the
tooling from outside Key4hep
-
Looking forward for EDM4hep 1.0 — new campaign of
centrally produced samples
-
Utilizing full potential of
DIRAC/iLCDirac
-
Understanding the data management with
Rucio
-
Crucial question:
What to generate?
-
Input on what datasets you would you like to get produced needed!