FCCAnalyses and Distributed Computing

Juraj Smieško (CERN)

9th FCC Physics Workshop

Munich, DE

29 January 2026

FCCAnalyses

Key4hep

Coherent set of packages, tools, and standards for future colliders

  • Common effort from FCC, CLIC/ILC, EIC, CEPC, Muon Collider, …
    • Preserves and adds onto existing functionality from iLCSoft, FCCSW, CEPCSW, …
    • Builds on top of the experience from LHC experiments and results of targeted R&D (AIDA, …)
    • Many institutes involved: CERN, DESY, IHEP, INFN, IJCLab, …
  • Each project rebases its stack on top of Key4hep
  • Having common building blocks enables synergies across collider communities
  • Main ingredients:
    • Event data model: EDM4hep, based on PODIO, AIDA project
    • Event processing framework: Gaudi, used in LHCb, ATLAS, …
    • Detector description: DD4hep, AIDA project
    • System to build, test and deploy: Spack, suggested by HSF + CVMFS
Key4hep logo
Key4hep design

FCCAnalyses

  • FCCAnalyses — common analysis framework for FCC:
    • ROOT RDataFrame based
    • Provides analyzer functions/functors of varying complexity
      • Analysis is composed out of those
    • Written in C++ and Python
    • Handles input dataset metadata
    • Manages running of the dataframe: locally or on HTCondor
    • Events data can be read in directly by RDF or through RDataSource
  • Accompanying infrastructure of FCCAnalyses
  • Getting help
Analysis graph
Example Higgs recoil analysis

Where to get FCCAnalyses

  • Latest tag: v0.12.0 was released two weeks ago
  • FCCAnalyses is distributed in the Key4hep stack:
    • Key4hep Release:
      source /cvmfs/sw.hsf.org/key4hep/setup.sh
    • Key4hep Nightlies:
      source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh
  • Developing FCCAnalyses:
    • git clone git@github.com:HEP-FCC/FCCAnalyses.git
    • source setup.sh -h
                          
                            
                          
                        
  • Specialized version for winter2023 datasets can be obtained only from GitHub
    • git clone --branch pre-edm4hep1 git@github.com:HEP-FCC/FCCAnalyses.git
    • Works with fixed Key4hep stack: 2024-03-10

What is an Analyzer

Set of self-contained functions/functors operating on ROOT dataframe

  • Many functions/functors to run on a dataframe columns provided from outside of FCCAnalyses (usually low level)
  • More specialized ones provided in FCCAnalyses standard library
    • A lot missing due to many input/output objects and their combinations
    • dNdx analyzers now properly use edm4hep::RecDqdxData
  • Current analyzers depend on the following C++ frameworks
    • ROOT — together with RDataFrame
    • ACTS — track reconstruction tools (not fully supported)
    • ONNX — neural network exchange format
    • FastJet — jet finding package
    • DD4hep — detector description
    • Delphes — fast simulations
  • Recent improvements in the ROOT Python interface
  • Fork model of FCCAnalyses creates many copies of the analyzers, which are shared among different groups
    • Upstream your functions/functors!
  • The operations on the dataframe happen
    with small stateless functions:
                        
                          
                        
                      
  • or with structs, which have internal state:
                        
                          
                        
                      

How to write an analysis

Analysis encapsulated into a class

                
                  
                
              
  • Run stage can be encapsulated into a class: Analysis
    • Possibility to provide parameters from outside
    • Avoids evaluating the parts of the analysis script at the bootstrap stage
  • Histmaker style analysis can also receive CLI arguments
  • Many attributes are being documented in
    fccanalysis-script, fccanalysis-final-script and fccanalysis-plots-script manual pages
  • The additional analyzers can be provided in a .hxx header file
    • The analyzers will get JIT compiled at startup
    • Old extension method involving additional compilation with CMake is deprecated

How to run an analysis

Analysis behavior depends on multiple categories of command-line arguments.

  • The anatomy of the fccanalysis command line interface:
    fccanalysis <global-args> <sub-command> <sub-command-args> <analysis-script> -- <script-args>

    Example:
    fccanalysis -vv run --n-threads 4 my_fcc_analysis.py -- --pt-min 40

    -- (double dash) required to properly provide the user arguments to the analysis script

  • The verbosity is a global argument and is synchronized with ROOT RDataFrame
  • Promising benchmarks showing recovering of performance when using podio::DataSource
    • Requires work on clever retrieval of collections
  • TupleWriter under development: k4FWCore#364
    • Base algorithm to create NTuples

Where to run an analysis

Making FCCAnalyses distributed

HTCondor logo
DIRAC Logo
iLCDirac Logo
  • FCCAnalyses supports HTCondor submission for a long time
    • fccanalysis submit ana-script.py
  • Easier running on local files with a file list
    • New -i, --input and -f, --input-file-list arguments
    • Sample location can be specified individually with input-dir parameter in the Process list
    • Running on centralized samples will get reworked
  • Users can submit FCCAnalyses to DIRAC
    • Example workflow shows how to submit jobs with winter2023 compatible FCCAnalyses
    • One sample per submission
    • Based on DIRAC General Application

Other approaches

Exploring other languages and software ecosystems

  • podio allows writing an analysis in C++ and Python
  • RDataFrame in C++ w/wo analyzers from FCCAnalyses
  • Coffea — awkward-arrays based columnar analysis framework
  • Julia: high-level, general-purpose dynamic programming language
                
                  
                
              
Coffea logo Julia HEP logo

Future Plans / Challenges

Focus on the flagship analyses important for the benchmarking of the detector designs

  • Support processing of RNTuple files
  • Expand distributed computing options
    • Allow running on grid GRID, Slurm or other platforms
  • Better ML support
    • Streamline current implementation
    • Investigate/adopt ROOT solutions e.g. TMVA SOFIE
    • Join the developments in Key4hep
  • Work on overcoming software ecosystem silos
    • Key4hep, PyHEP, DiracOS, Rucio, …
    • Streamline NTuple production mechanism and improve performance of directly handling EDM4hep objects
    • Invest time in tooling/approaches which can bridge the gap
  • Make the framework more robust and expand its capabilities
    • Merge staged and histmaker styles together
    • Bring CMS Combine to the Key4hep stack
    • Revive the plotting capabilities
Brace yourselves for more breaking changes!

Distributed Computing

Trinity of Dataset Production Systems

Designing the dataset production system for the pre-TDR era and beyond

Centralized dataset production system
  • Traditionally the centralized dataset productions are handled with
    • Workload manager — DIRAC/iLCDirac
    • Dataset manager — DIRAC/iLCDirac or Rucio
    • Metadata manager — FCC Physics Events
  • DIRAC dataset management used by the LHCb
  • Investigations of the capabilities of Rucio underway
  • Capabilities of FCC Physics Events to be expanded
    • Support for API calls
    • Full dataset provenance
  • Crucial ingredient: How to address the datasets?
    • At the moment we use:
      • LFNs (Logical File names) in DIRAC, e.g.:
        fcc/ee/test_spring2024/240gev/Hbb/CLD_o2_v07/rec/
      • Collection of 5 tags in EventProducer, e.g.:
        accelerator, campaign, stage, detector, process

FCC VO

Resources available in the FCC Virtual Organization

DIRAC Overview
DIRAC Resources Overview
Rucio Overview
Rucio Architecture Overview
  • FCC resources administered by FCC VO FCC VO
    • Registered at EGI
    • Could be used by individual users and for centralized dataset production
    • Support e-group: fcc-vo-support@cern.ch
  • Resources in the VO:
    • Four fully operational Storage Elements
    • Only one Compute Element in the VO
    • More in various stages of readiness
    • Resource pledges are not rigidly formalized yet
  • Continuing to exercise the system in a test campaign
    • Full campaign will be launched soon
    • Productions configuration kept in FCCDIRAC repository
    • Succesfuly tested job result storage on the operational SEs
    • Testing replications done through DIRAC to CNAF-DISK and BARI-DISK
    • Rucio pilot replications underway

DIRAC / iLCDirac @ FCC

Interware to exploit distributed heterogeneous resources

DIRAC Logo iLCDirac Logo
DIRAC Applications
  • Created by LHCb, these days used by Belle2, ILC, CTAO, CEPC, …
  • High level interface to interact with GRIDs, Clouds, HPCs and Batch systems
  • iLCDirac: set of extensions / applications developed for the linear collider studies
  • Offers data / metadata management
    • File transfers / replications, metadata augmented file catalog
  • Has web interface to control aspects of the system
      Transformations, jobs, accounting, system administration
  • New developments happen under the DiracX project
  • Software needs to be wrapped into a DIRAC application
    • Many already available thanks to iLCDirac
    • More to be added to support users/productions of FCC
      • MC generators, interfaces for Key4hep based applications, ...
  • See talk by A. Sailer

Rucio

Scientific data management

  • Rucio provides a system to manage experiment datasets across distributed resources
  • Initially developed in ATLAS, now used by a number of large and small collaborations
    • Showing the ability to scale to very large pools of data
  • Organizes data in a hierarchy of namespaces, containers, datasets and files
  • Placement of the data dependent of the replication rules
    • Describing storage and time limitations
  • Interfacing between DIRAC and Rucio already done by Belle2
  • FCC Rucio instance managed by the CERN IT
  • See talk by G. Guerrieri
Rucio logo

Centrally produced datasets

Current situation with already produced datasets

  • Datasets include generator level files and parametrized simulation for FCC-ee and FCC-hh
  • Stored on EOS at:
    /eos/experiment/fcc/<accelerator-type>/generation/
  • Detailed information about EventProducer datasets published on FCC Physics Events website
  • FCCAnalyses framework automatically picks up the sample information from YAML and JSON files
      The interface under overhaul, moving towards API call(s)
  • Several FCC-hh Fullsim and Delphes campaigns moved to tape
    • Sample metadata kept on the website for historical record

FCC Physics Events

Evolving FCC Physics Events into Metadata Manager

  • New FCC Physics Events Website based on modern web technologies
  • Can ingest multitude of datatypes
  • Core categories are columns in the database, rest of metadata is stored as a JSON string
  • Already provides API endpoint for serving dataset metadata in a JSON dictionary
  • See talk by M. Cechovic
  • Plan to implement dataset provenance
    • Open to switching to a more robust tool down the line

Flare

FCCee b2Luigi Automated Reconstruction and Event processing

  • Framework allowing to organize user event processing chain into dynamic workflows based on b2luigi
  • Handles scheduling of complex workflows well
  • Orchestration happens at the level of executables
  • Integrated into the LCG based Key4hep stack
  • See talk by C. Harris
b2luigi logo
b2luigi logo

Conclusions

  • The FCCAnalyses framework evolves from the needs for the FSR case studies
    • Interface breaking changes are being gradually introduced
    • FullSim flagship analyses will require multitude of adjustments: analyzers, distributed computing, metadata access, …
    • Integrate or lower to threshold for interfacing the tooling from outside Key4hep
  • Looking forward for EDM4hep 1.0 — new campaign of centrally produced samples
    • Utilizing full potential of DIRAC/iLCDirac
    • Understanding the data management with Rucio
  • Crucial question: What to generate?
    • Input on what datasets you would you like to get produced needed!