Usage of EDM4hep datamodel in RDataFrame

Juraj Smieško (CERN)

ROOT Users Workshop 2025

CERN

19 November 2025

Key4hep

Coherent set of packages, tools, and standards for future colliders

  • Common effort from FCC, CLIC/ILC, EIC, CEPC, Muon Collider, …
    • Preserves and adds onto existing functionality from iLCSoft, FCCSW, CEPCSW, …
    • Builds on top of the experience from LHC experiments and results of targeted R&D (AIDA, …)
    • Many institutes involved: CERN, DESY, IHEP, INFN, IJCLab, …
  • Each project rebases its stack on top of Key4hep
  • Having common building blocks enables synergies across collider communities
  • Main ingredients:
    • Event data model: EDM4hep, based on PODIO, AIDA project
    • Event processing framework: Gaudi, used in LHCb, ATLAS, …
    • Detector description: DD4hep, AIDA project
    • System to build, test and deploy: Spack, suggested by HSF + CVMFS
  • Is there a Key4hep logo?
Key4hep design

FCC Analysis

  • FCCAnalyses — common analysis framework for FCC:
    • ROOT RDataFrame based
    • Provides analyzer functions/functors of varying complexity
      • Analysis is composed out of those
    • Written in C++ and Python
    • Handles input dataset metadata
    • Manages running of the dataframe: locally or on HTCondor
    • Events data can be read in directly by RDF or through RDataSource
  • Accompanying infrastructure of FCCAnalyses
  • Getting help
Analysis graph
Example Higgs recoil analysis

EDM4hep I.

Common language for processing and persistifying data

EDM4hep diagram
  • Specification in a single YAML file
    • Describes standard data structures and relations between them
  • Generated by PODIO (developed as part of AIDA R&D)
  • Challenge: efficiency and thread safeness
  • Created by consensus
  • Trade-off between being generic and preserve compactness
  • Moving towards first stable LTS version (v1.0)

EDM4hep II.

Example object definition:

            
              
            
          
  • Current version: v0.99.04
  • New objects can be added using the extension mechanism
  • Bi-weekly discussion: Indico

PODIO

Generates Event Data Model and serves as I/O Layer

  • Generates C++ classes from YAML file using Jinja2 templating
  • Employs plain-old-data (POD) data structures
  • I/O machinery consists of three layers
    • POD Layer - actual data structures
    • Object Layer - helps resolve the relations
    • User Layer - user facing layer with fully fledged objects
  • Supports multiple backends:
    • ROOT TTree, ROOT RNTuple, SIO
  • Current version: 1.06

PODIO Reader

Constructs the EDM4hep objects for the user

Example usage of PODIO Reader in Python:

                  
                    
                  
                
Podio Frame

ROOT RDataFrame + PODIO

Employing of the RDataSource API

  • RDataSource provides EDM4hep(PODIO) collections to the RDataFrame event-by-event
  • Collections are constructed by PODIO readers
  • RDataSource decides how to organize reading of the events
    • To be optimized
  • Multi-threaded: One PODIO Reader per thread
  • All PODIO backends supported out of the box(TTree, RNTuple, SIO)
  • EDM4hep schema evolution kicks in

Reading EDM4hep in FCCAnalyses

  • EDM4hep collection is read in by RDataFrame directly and presented to the user in the form:
                    
                      
                    
                  
    • This is per event
    • No convenient access to relationships
  • Example of a simple function:
                    
                      
                    
                  
  • In the course of the analysis the EDM4hep slowly decays into flatter objects, i.e. PODs and POD vectors

EDM4hep Relations with RVecs

  • One collection can contain one-to-one or one-to-many relations to other collections, e.g.:
    • RecDqdxTrack
    • CaloHitCaloHitContribution
    • MCParticleMCParticle
  • Typically relationships between related objects (Sim. side separated from Reco. side)
  • Handling of the indexes is cumbersome
  • Example analyzer (FCC Tutorials link):
                    
                      
                    
                  

EDM4hep Relations Using DataSource

  • One collection can contain one-to-one or one-to-many relations to other collections, e.g.:
    • RecDqdxTrack
    • CaloHitCaloHitContribution
    • MCParticleMCParticle
  • Typically relationships between related objects (Sim. side separated from Reco. side)
  • Possible rewrite:
                    
                      
                    
                  

Links

  • One-to-one relationships between two collection types, e.g.:
    • MCParticleReconstructedParticle
    • SimTrackerHitTrackerHit
  • Relationships between Simulation and Reconstruction side
  • They behave as collections, there can be more than one link per object
  • Example analyzer: Link/Association between RecoParticle and MCParticle (link):
                    
                      
                    
                  

Links in DataSource

  • One-to-one relationships between two collection types, e.g.:
    • MCParticleReconstructedParticle
    • SimTrackerHitTrackerHit
  • Relationships between Simulation and Reconstruction side provided out of the box
  • Users can define their own
  • Possible rewrite:
                    
                      
                    
                  
  • Allows to use EDM4hep/PODIO tools to resolve all links to an object:
                    
                      
                    
                  

Benchmark I.

Example Higgs analysis, threads: 1

Benchmark II.

Single hist, C++, threads: 1
  • Create single histogram of EFlowPhoton cluster energy
  • PODIO Source, Dummy Source, and plain PODIO request all collections of the event

Benchmark III.

One branch, C++, threads: 1
  • Create single histogram of EFlowPhoton cluster energy
  • Making sure only one branch is read

Conclusions

  • Focus of EDM4hep is in Simulation & Reconstruction
  • Let's try to avoid reimplementing EDM4hep objects/collections in the RDF Analysis
    • Current index shuffling is tedious and error prone in direct RDataFrame
    • Changes to EDM4hep/PODIO require adaptation of analysis code
  • With podio::DataSource:
    • Analyzer can work with fully fledged EDM4hep objects
    • Schema evolution handled by PODIO itself
    • Performance can get close to direct RdataFrame
  • Continuing with podio::DataSource optimizations:
    • Properly distribute the event ranges
    • Activate only the collections needed
Thanks to: V. Padulano,
S. Hageboeck, and T. Madlener