Analysis and Visualization

Juraj Smieško (CERN)

FCC Week 2025

Vienna, AT

21 May 2025

Content

  • Analysis technologies
    • FCCAnalyses: status and plans
    • Other analysis approaches
  • Centralized dataset productions
    • Event Producer
    • Status of DIRAC based productions
  • Visualization

Analysis technologies

FCCAnalyses Overview

Analysis framework inside Key4hep ecosystem build
on top of the ROOT RDataFrame

FCCAnalyses place in the event reconstruction chain.
  • Provides standard library of functions
    • Many C++ HEP frameworks available
  • Automatic retrieval of metadata for the centrally produced datasets
    • Including the location and selection of the input files
  • Runs the ROOT RDataFrame
    • Local or remote execution
  • Helps with histograms/plots
    • Export of results to other tools
  • Registry for the analyses
    • Dedicated place for all case-studies

Versions of FCCAnalyses

Or how to get and run FCCAnalyses

  • Latest version of FCCAnalyses can be obtained from:
    • GitHub:
      git clone git@github.com:HEP-FCC/FCCAnalyses.git
    • Key4hep nightlies stack:
      source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh
  • Latest released version of FCCAnalyses is v0.10.0 and can be obtained from:
    • GitHub:
      git clone --branch v0.10.0 git@github.com:HEP-FCC/FCCAnalyses.git
    • Key4hep release stack:
      source /cvmfs/sw.hsf.org/key4hep/setup.sh
  • Specialized version for winter2023 samples can be obtained only from:
    • GitHub:
      git clone --branch pre-edm4hep1 git@github.com:HEP-FCC/FCCAnalyses.git
  • Recommendation:
    FCCAnalyses can be run without compiling — fccanalysis command is part of the Key4hep stack

Standard library of analyzers

Set of self-contained functions/functors operating on ROOT dataframe

  • Many functions/functors to run on dataframe columns provided from outside of FCCAnalyses (usually low level)
  • FCCAnalyses provides more specialized ones in its standard library
      A lot missing due to many input/output objects and their combinations
  • Analyzers can depend on following C++ frameworks
    • ROOT — together with RDataFrame
    • ACTS — track reconstruction tools (not fully supported)
    • ONNX — neural network exchange format
    • FastJet — jet finding package
    • DD4hep — detector description
    • Delphes — fast simulations
  • Fork model of FCCAnalyses creates many copies of the analyzers, which are shared among different groups
    • Call: Upstream your functions/functors
  • The operations on the dataframe happen
    with small stateless functions:
                        
                          
                        
                      
  • or with structs, which have internal state:
                        
                          
                        
                      

Towards better defined analysis script

Analysis encapsulated into a class

                
                  
                
              
  • In both (staged or histmaker) styles various attributes are used to adjust behavior of the running of the analysis script
  • It is not clear which are needed and when
  • The attributes are being documented in
    fccanalysis-script, fccanalysis-final-script and fccanalysis-plots-script manual pages
  • In order to better define the script interface all attributes are being moved into the analysis class
    • Not yet done for final and plots stages
  • This is a breaking change
    • For now the old style of analysis is still supported

Analysis CLI arguments

Analysis script can use parameters provided from the command-line.

  • Command line arguments are fed into the Analysis class
  • They need to be parsed by the script itself
  • All arguments provided after -- (double dash) are considered to belong to the script
                
                  
                
              

The anatomy of the fccanalysis command line interface:
fccanalysis <global-args> <sub-command> <sub-command-args> <analysis-script> -- <script-args>

Example:
fccanalysis -vv run --n-threads 4 my_fcc_analysis.py -- --pt-min 40

-- (double dash) introduced in PR#422

Submit sub-command

Decoupling submission machinery from analysis execution

  • Done in order to allow for other forms of distributed computing different from HTCondor
  • Improves also current HTCondor submission machinery
  • Will allow running of the Histmaker style analyses on HTCondor
  • Usage: fccanalysis submit ana_script.py
  • Merged recently: PR#422
Hosted on \ Output to AFS EOS
AFS
EOS
Supported ways to submit to HTCondor

FCCAnalyses Plans

  • Focus on distributed computing
    • Allow running on GRID, Slurm or other platforms
  • Make FCCAnalyses able to run on non LXPlus machines
  • Better ML support
      Depending not only on the ROOT capabilities
      • A master student will start investigations soon
  • Expand plotting capabilities
    • Many products in the Python HEP space
    • Integrate existing tools and/or make FCCAnalyses more interoperable
  • Improve performance of EDM4hep relationship handling
    • podio::DataSource performance affected by podio/EDM4hep data alignment
  • Streamline NTuple production mechanism

Other analysis approaches

Python and Julia based analysis approaches

  • podio Python bindings
  • Coffea — awkward-arrays based columnar analysis framework
  • Julia: high-level, general-purpose dynamic programming language
  • C++: Raw podio/EDM4hep, RDataFrame with analyzers from FCCAnalyses
                
                  
                
              
Coffea logo Julia HEP logo

Centralized dataset productions

Centrally produced datasets

Bookkeeping of centrally produced datasets

  • Datasets include generator level files, parametrized simulation and Fullsim samples for FCC-ee and FCC-hh
  • Stored on EOS at:
    /eos/experiment/fcc/<accelerator-type>/generation/
    /eos/experiment/fcc/prod/fcc/<accelerator-type>/
  • Detailed information about the samples published on FCC Physics Events website
    • Storage backend now uses the SQL database instead of plain JSON files
    • Make dataset database searchable — 2025 Summer Student Project
  • FCCAnalyses framework automatically picks up the sample information from YAML and JSON files
      The interface under overhaul, to allow the information to be consumed also by other analysis solutions
  • Old FCC-hh Fullsim and Delphes samples moved to tape
    • Sample metadata kept on the website for historical record

DIRAC / iLCDirac

Interware to exploit distributed heterogeneous resources

DIRAC Logo iLCDirac Logo
DIRAC Overview
  • Created by LHCb, these days used by Belle2, ILC, CTAO, CEPC, …
  • High level interface to interact with Grids, Clouds, HPCs and Batch systems
  • Data / Metadata management
    • File transfers / replication, metadata augmented file catalog
  • Web interface to control aspects of the system
      Transformations, jobs, accounting, system administration
  • Resources organized into virtual organizations (VOs)
  • iLCDirac: set of extensions / applications developed for the linear collider studies

DIRAC / iLCDirac @ FCC

Resources available in the FCC Virtual Organization

  • FCC resources administered by FCC VO
    • Could be used by individual users and for centralized dataset preparation
    • Support e-group: fcc-vo-support@cern.ch
  • Storage Elements attached to FCC VO:
    • GLASGOW-DISK: Operational
    • CNAF-DISK: Operational
    • BARI-DISK: In progress
  • Started to exercise the system to produce a test campaign
    • Full campaign will be launched after EDM4hep 1.0 is finalized
    • Configuration kept in FCCDIRAC repository
  • Software needs to be wrapped into a DIRAC application
    • Many available thanks to iLCDirac
    • More added to support users/productions of FCC
      • More MC generators, interfaces for Key4hep based applications
  • Will need to come up with the organizational structure for the production of the datasets
DIRAC Applications
How to use iLCDirac summarized by A. Sailer in: Monte Carlo Productions for Full Simulation Studies
(The third ECFA workshop on e+e- Higgs, Electroweak and Top Factories)

Visualization

Visualization Overview

Tools living in or cooperating with the Key4hep ecosystem

Geant4 logo.
ARC detector proposal visualized by Geant4
source: A. Tolosa-Delgado
Geant4 logo.
JSROOT visualization of layered polyhedron builder
source: M. Ali
  • Part of an established framework
  • For global overview
    • CED: OpenGL based, originates in iLCSoft
    • Phoenix: an experiment independent web-based event display
  • Specialized use cases
    • calodisplay: purpose build visualization tool
    • eede: EDM4hep event data explorer
    • PandoraMonitoring: ROOT based visualization environment for Pandora
    • Computational graph of the analysis, example:
      • fccanalysis run -r ana_script.py
Phoenix@FCC
eede
eede

Visualization status

Need for more specialized features tailored for MC, tracking, calorimetry, …

  • Web based geometry visualization
    • Usually web tools offer most hassle free experience
    • Good for the overview, lack of specialization
      • Expansion of current capabilities to cover common specialized tasks: MC, tracking, calorimetry, ...
    • Improve geometry conversion and manipulation
      • Missing user configurability
  • Datamodel visualization
    • Expansion of supported collections
    • Embedability can extend number of use cases
  • Analysis visualization
    • Only basic tree visualization
    • Missing configurability
    • Lack of interactivity
Analysis graph
Example Higgs recoil analysis
calodisplay lar
Calorimeter cells in calodisplay
source: G. Marchiori
Engage with / Revive the efforts of HSF Visualization WG

Conclusions

  • The FCCAnalyses framework used by the majority of FSR case studies
    • Breaking changes are being gradually introduced
    • Fullsim analyses are coming, more analyzer adjustments needed
    • Other approaches possible:
      podio ROOT bindings, Coffea, Julia, …
  • Looking forward for EDM4hep 1.0 — new campaign of centrally produced samples
    • Utilizing full potential of DIRAC/iLCDirac
  • Visualization tools in need of domain specialized features
  • Weekly meetings happen on Wednesdays 04:00 PM
    • Informal meeting to debug and discuss analysis related issues

Backup

All the Links

Contacts

  • FCCAnalyses section at FCC Software forum
  • FCC Analysis Mattermost channel
  • FCC-PED SW Analysis mailing list:
    FCC-PED-SoftwareAndComputing-Analysis@cern.ch

Documentation

Key4hep

Coherent set of packages, tools, and standards for different collider Concepts

  • Common effort from FCC, CLIC/ILC, EIC, CEPC, Muon Collider, …
    • Preserves and adds onto existing functionality from iLCSoft, FCCSW, CEPCSW, …
    • Builds on top of the experience from LHC experiments and results of targeted R&D (AIDA, …)
    • Many institutes involved: CERN, DESY, IHEP, INFN, IJCLab, …
  • Each project rebases its stack on top of Key4hep
  • Having common building blocks enables synergies across collider communities
  • Main ingredients:
    • Event data model: EDM4hep, based on PODIO, AIDA project
    • Event processing framework: Gaudi, used in LHCb, ATLAS, …
    • Detector description: DD4hep, AIDA project
    • System to build, test and deploy: Spack, suggested by HSF + CVMFS
Key4hep design

EDM4hep

Common language for processing and persistifying data

EDM4hep diagram
  • Specification in a single YAML file
    • Describes standard data structures and relations between them
  • Generated by PODIO (developed as part of AIDA R&D)
  • Challenge: efficiency and thread safeness
  • Created by consensus
  • Trade-off between being generic and preserve compactness
  • Moving towards first stable LTS version (v1.0)

PODIO Datasource

Preserving EDM4hep relationships in RDataFrame

  • Collection objects in PODIO/EDM4hep can be accessed on several layers
  • Highest layer provides one-to-many and many-to-many relationships
  • Easy access to the related objects greatly improves writing and understanding of the analyzer
  • Price for this convenience is performance
      Alternative: NTuple creation in Gaudi algorithm

Enabling PODIO Datasource in the analysis:

  • Use analyzers which take as input EDM4hep collections:
    edm4hep::ReconstructedParticleCollection, edm4hep::RecoMCParticleLinkCollection, ...
  • Instruct FCCAnalyses to use podio::DataSource:
    self.use_data_source = True or fccanalysis run --use-data-source ana_script.py
Podio layers

EOS Analysis Space

Various intermediate files of common interest can be stored centrally

FCC-ee space is located at:
/eos/experiment/fcc/ee/analyses_storage/...

in four sub-folders:

  • BSM
  • EW_and_QCD
  • flavor
  • Higgs_and_TOP

Access and quotas:

  • Read access is granted to anyone
  • Write access needs to be granted: Ask your convener :)
  • Total available quota for all four sub-directories is 140TB
    • Currently 37T used
    • Quota is allocated based on actual needs

Analysis registry

Central registry for the FCC case studies

  • Two repositories:
  • Experimentally one can create analysis package for analysis specific code
  • Rudimentary and in need of overhaul
    • Based on the post FSR needs

FCC-ee case studies list. FCC-hh physics performance documentation.

Tests and benchmarks

Ensuring correctness and performance

FCCAnalyses benchmarks.
  • Tests of the FCCAnalyses framework are done daily
    • Test suite is published in FCCTests
    • Goal is to ensure FCCAnalyses be available in the Key4hep nightlies stack
    • Every test run inside independent subprocesses
    • Tests need access to LXPlus like environment
      • Missing HTCondor testing
  • Benchmarks of FCCAnalyses is done after every merge of a PR
    • Benchmarks run on small set of events
    • There is no guaranteed machine to run on
    • Benchmarks of Higgs mass recoil example on LXPlus:
      20k–45k evt/s