podio::(ROOT)DataSource

Juraj Smieško (CERN)

FCC Software Meeting

CERN, 06 May 2024

Key4hep

  • Set of common software packages, tools, and standards for different Detector concepts
  • Common for FCC, CLIC/ILC, CEPC, EIC, …
  • Individual participants can mix and match their stack
  • Main ingredients:
    • Data processing framework: Gaudi
    • Event data model: EDM4hep
    • Detector description: DD4hep
    • Software distribution: Spack
HEP Stack
Key4hep design
Source: Frank Gaede

EDM4hep I.

Describes event data with the set of standard objects.

  • Specification in a single YAML file
  • Generated with the help of Podio
EDM4hep diagram

EDM4hep II.

Example object:

              
                
              
            
  • Current version: v0.10.5
  • Objects can be extended / new created
  • Bi-weekly discussion: Indico

EDM4hep 1.0

The EDM4hep will reach version 1.0 soon, breaking changes and fixes are introduced.

Some of the changes/fixes underway:

              
                
              
            

New release of FCCAnalyses 0.9 — preserves state before EDM4hep 1.0 changes

  • Will arrive in stable Key4hep stack soon

Podio

Generates Event Data Model and serves as I/O Layer

  • Generates EDM from YAML files
  • Employs plain-old-data (POD) data structures
  • I/O machinery consists of three layers
    • POD Layer - actual data structures
    • Object Layer - helps resolve the relations
    • User Layer - full fledged EDM objects
  • Supports multiple backends:
    • ROOT, SIO, ...
  • Current version: 0.99

Podio Reader

Constructs the EDM4hep objects for the user

Example usage of Podio Reader in Pyhton:

                    
                      
                    
                  

ROOT RDataFrame

ROOT RDataFrame Illustration
  • Describes processing of data as actions on table columns
    • Defines of new columns
    • Filter rules
    • Result definitions (histogram, graph)
  • The actions are lazily evaluated
  • Multi threading is available out of the box
  • Optimized for bulk processing
  • Allows integration of existing C++ libraries

ROOT RDataSource + Podio

RDataSource defines an API that RDataFrame can use to read arbitrary data formats.

  • RDataSource provides EDM4hep(Podio) collections to the RDataFrame event-by-event
  • Collections are constructed by Podio readers
  • RDataSource can decide how to organize reading of the events
    • ATM: Not at all optimized
  • Multi-threaded implementation
  • Might easily support other Podio backends (SIO, ...)
  • Schema evolution support

Reading EDM4hep in FCCAnalyses

  • EDM4hep collection is read in by RDataFrame directly and presented to the user in form:
                      
                        
                      
                    
    • This is per event
    • No convenient access to relationships
  • Example of a simple function:
                      
                        
                      
                    
  • In the course of the analysis the EDM4hep slowly decays into more trivial objects
EDM4hep diagram

Relations

  • One collection can contain one-to-one or one-to-many relations to other collections, e.g.:
    • CaloHitCaloHitContribution
    • MCParticleMCParticle
  • Typically relationships between derived objects (Sim. side separated from Reco. side)
  • Example analyzer (FCC Tutorials link):
                      
                        
                      
                    

Relations in DataSource

  • One collection can contain one-to-one or one-to-many relations to other collections, e.g.:
    • CaloHitCaloHitContribution
    • MCParticleMCParticle
  • Typically relationships between derived objects (Sim. side separated from Reco. side)
  • Possible rewrite:
                      
                        
                      
                    

Associations

  • One-to-one relationships between two collection types, e.g.:
    • MCParticleReconstructedParticle
    • SimTrackerHitTrackerHit
  • Relationships between Simulation and Reconstruction side
  • Example analyzer: Association between RecoParticle and MCParticle (link):
                      
                        
                      
                    

Associations in DataSource

  • One-to-one relationships between two collection types, e.g.:
    • MCParticleReconstructedParticle
    • SimTrackerHitTrackerHit
  • Relationships between Simulation and Reconstruction side
  • Possible rewrite:
                      
                        
                      
                    

Small sample benchmarks

  • Three scenarios:
    • Simple C++ analysis
    • C++ analysis with associations
    • Python analysis: analysis_stage1.py example
  • 300k events
  • Local storage
  • 4 threads
  • DataSource takes approx. 2× memory and 2× execution time
  • More tests needed:
    • More complex (real) analysis
    • Running on over the network sample

Simple C++ analysis

DataSource:

DataSource in simple C++ analysis, threads: 4
  • Max RAM: 2.6 GiB
  • Run time: 23 s

Current implementation:

simple C++ analysis, threads: 4
  • Max RAM: 1.6 GiB
  • Run time: 8 s

C++ analysis with Associations

DataSource:

DataSource in C++ analysis with associations, threads: 4
  • Max RAM: 2.3 GiB
  • Run time: 45 s

Current implementation:

C++ analysis with associations, threads: 4
  • Max RAM: 1.45 GiB
  • Run time: 23 s

FCCAnalyses stage 1 (Python)

DataSource:

DataSource in stage1 analysis, threads: 4
  • Max RAM: 3.4 GiB
  • Run time: 75 s

Current implementation:

stage1 analysis, threads: 4
  • Max RAM: 0.95 GiB
  • Run time: 19 s

Documentation

Multiple sources of documentation

Conclusions

  • Primary focus of EDM4hep is in Reconstruction --- dense event format
  • Current index management is tedious and error prone.
  • Changes to EDM4hep/Podio require regenerating samples
  • With podio::(ROOT)DataSource:
    • Analyzers work with fully fledged EDM4hep objects
    • Schema evolution handled by Podio reader
    • Layer between ROOT file and RDataFrame
      • Costs: 2x RAM and 2x CPU time
  • Podio PR: #593, FCCAnalyses PR: #309

Backup

Example analysis

The Higgs boson mass and σ(ZH) from the recoil mass with leptonic Z decays (link)