FCCAnalyses Developments

Status for May 2023

Juraj Smieško for the FCCSW team

CERN

FCC Software Meeting

30 May 2023

FCCAnalyses Scope

Goal of the framework is to aid the user in obtaining the desired physics results from the reconstructed objects

Framework requirements:

  • Efficiency — Make quick turn-around possible
  • Flexibility — Allow heavy customization
  • Ease of use — Should not be hard to start using
  • Scalable — Handling of large datasets

Key4hep

  • Set of common software packages, tools, and standards for different Detector concepts
  • Common for FCC, CLIC/ILC, CEPC, EIC, …
  • Individual participants can mix and match their stack
  • Main ingredients:
    • Data processing framework: Gaudi
    • Event data model: EDM4hep
    • Detector description: DD4hep
    • Software distribution: Spack

EDM4hep I.

Describes event data with the set of standard objects.

  • Specification in a single YAML file
  • Generated with the help of Podio

EDM4hep II.

Example object:

#-------------  CalorimeterHit
edm4hep::CalorimeterHit:
  Description: "Calorimeter hit"
  Author : "F.Gaede, DESY"
  Members:
    - uint64_t cellID            //detector specific (geometrical) cell id.
    - float energy               //energy of the hit in [GeV].
    - float energyError          //error of the hit energy in [GeV].
    - float time                 //time of the hit in [ns].
    - edm4hep::Vector3f position //position of the hit in world coordinates in [mm].
    - int32_t type               //type of hit. Mapping of integer types to names via collection parameters "CalorimeterHitTypeNames" and "CalorimeterHitTypeValues".
  • Current version: v0.8.0
  • Objects can be extended / new created
  • Bi-weekly discussion: Indico

Datasets

Plethora of processes are pre-generated and available from EOS

ROOT RDataFrame

  • Describes processing of data as actions on table columns
    • Defines of new columns
    • Filter rules
    • Result definitions (histogram, graph)
  • The actions are instantly or lazily evaluated
  • Multi threading is available out of the box
  • Optimized for bulk processing

Available libraries

The physics analysis often depends on multitude of libraries

Libraries integrated into the framework:

  • ROOT — together with RDataFrame
  • ACTS — track reconstruction tools
  • ONNX — neural network exchange format
  • FastJet — jet finding package
  • DD4hep — detector description
  • Delphes — fast simulations

Distribution

FCCAnalyses latest release v0.7.0 can be found:

  • As a package in the stable Key4hep stack
    • Allows to quickly put together small analysis
    • Limited options for customization
  • As a tarball/tag from GitHub

Latest/development version of the FCCAnalyses can be found:

  • As a package in the nightlies Key4hep stack
    • Might easily break
    • Latest master
  • By checking out master branch
    • Allows greater customization
    • Requires discipline
    • Hint: Keep your master in sync with upstream (use rebase or merge)
    • Developments are welcome to be merged :)
    • master should be always buildable

Platforms: CentOS 7, AlmaLinux 9, Ubuntu 22.04

Ecosystem

Analysis spread through two repositories:

  • FCCAnalyses
    • Repository of common tools and algorithms
    • General analysis code in analyzers
    • Steering of the analysis (RDataFrame)
    • Access to the datasample (meta)data
    • Running over large datasets / on batch
    • Experimetal machinery for case studies
  • FCCeePhysicsPerformance
    • Main place for the abstracts
    • Contains very specific analysis code
      • Or prototypes of tools of common interest to be eventually moved to FCCAnalysis
    • (Proto)package repository

Analysis Architecture I.

One can write and run an analysis in several ways:

  • Managed mode:
    • The RDataFrame frame is managed by the framework
    • User provides Python analysis script with compulsory attributes
    • Libraries are loaded automatically
    • Dataset metadata are loaded from remote location — CVMFS/HTTP server
    • Batch submission on HTCondor
    • Customization: Possible at the level of analyzer functions
    • Intend for: Quick analysis, no advanced analyzer functions

Analysis Architecture II.

One can write and run an analysis in several ways:

  • Standalone mode:
    • The RDataFrame frame is managed by the user
    • Can leverage the FCCAnalyses library of analyzer functions
    • The analysis can be written as a Python script or C++ program
    • Loading of the libraries is handled by the user
    • Dataset metadata have to be handled manually
    • Batch submission is not provided
    • Customization: Creation and steering of the RDataFrame
    • Intended for: Advanced users
  • Ntupleizer style:
    • Intend is to create just flat trees and continue without the framework help

Writing an analyzer

The library of analyzer functions (analyzers) have been written over the years

  • Analyzers are usually structs which operate on an EDM4hep objects
  • Optional dependencies for analyzers can be FastJet, DD4hep, ACTS and ONNX
  • ROOT RDataFrame needs to be aware of the analyzer function
    • Provided as a string
    • Compiled in the library
    • Loaded and JITed by the ROOT.gInterpreter

FCCAnalyses library

  • Vertexing
  • ACTS vertex finder
  • Event variables
  • Calorimeter hit/cluster variables
  • Reconstructed/MC particle operations
  • Flavour tagging
  • Jet clustering/constituents

Workflow

  • The complete analysis in managed mode is divided into three steps (example):
    • analysis_stage1.py, ... — pre-selection stages, analysis dependent, usually runs on batch
    • analysis_final.py — final selection, produces final variables
    • analysis_plots.py — produces plots from histograms/TTrees
  • or into two with the help of Histmaker (example):
    • The pre-selection stages and final stage are combined together
    • Plotting stage

EOS Space

Various intermediate files of common interest can be stored at:
/eos/experiment/fcc/ee/analyses_storage/...

in four subfolders:

  • BSM
  • EW_and_QCD
  • flavor
  • Higgs_and_TOP

Access and quotas:

  • Read access is is granted to anyone
  • Write access needs to be granted: Ask your convener :)
  • Total quota for all four directories is 200TB
  • ATM only part of the quota is allocated

Recent changes

Included in v0.7.0:

  • External libraries as addons: PR#194
  • EOS paths accessed through xrootd: PR#202
  • Case studies (proto)packaging: PR#199
  • Inclusion of ONNX + Jet flavour tools: PR#188, PR#224
  • Inclusion of Delphes + Vertexing from Franco Bedeschi: PR#247
  • New sub-commands — build, pin
  • 2D and 3D histograms: PR#253
  • Benchmarking and testing of the example analyses

Available in the master:

  • Improvements in dNdx, time and energy smearing: PR#268
  • Statistical uncertainty and rebin options in the plotter: PR#269
  • Histmaker: PR#277
  • Improved crash reports: PR#276
  • More track utilities: PR#289
  • Code formatting for the analyzers
  • Modularization of the python machinery

Physics Results I.

Decay of an HNL into a muon and two jets

  • BSM/LLP Analysis
  • Private fork with the customizations applied on top
  • Run in managed mode
  • New analyzers, adjustments to the managed mode
  • Uses mix of official and private productions

Physics Results II.

H to invisible

  • Higgs Analysis
  • Could not find the source code
  • Uses officially produced samples

Physics Results III.

Tagger on Z to qq events

  • Example of advanced usage of the framework
  • Uses combination of managed mode and custom python scripts
  • Leverages recently included libraries: Delphes, ONNX
  • Uses officially produced samples

Plans

Reliable framework to aid the physics performance studies for FSR

  • Heads up: Podio frame I/O in Gaudi, PR#100
  • Heads up: Podio collection ids to hashes, PR#412
  • Heads up: RNTuple backend, PR#359
  • Make the framework free from lxplus/HTcondor
  • EDM4hep low level access unwieldy
  • Overhaul "standard library" and disentangle the dependencies
  • Prepare facilities to handle systematics
  • Find distribution channels which allow as wide customization as possible
  • Support fullsim detector studies
  • Support running on the distributed systems (Dirac)

Documentation

There are several sources of documentation

Conclusions & Outlook

  • The combination of EDM4hep and RDataFrame works well
    • Low level access unwieldy
  • Modularization and packaging options under way
  • Started focusing on the full simulation detector studies
    • Access to the detector description through the framework
      • More heads
        are welcome!
        Babyface from Toy Story, Pixar

Backup

FCCAnalyses vs. Coffea/Coffea-casa

  • Provides similar set of features to FCCAnalyses
  • Dataframe in coffea, Orchestration in coffea-casa
  • User interface purely pythonic
  • Integrated into python package ecosystem
  • FCCAnalysis purpose build for FCC
  • Integration with SWAN and Dask

FCCAnalyses batch submissions

  • FCCAnalyses allows users to submit their jobs onto HTCondor
  • It bootstraps itself with use of scripts in subprocesses
  • Framework creates two files
    • Shell script with fccanalysis command
    • Condor configuration file
  • There is also possibility to add user provided Condor parameters
  • Condor environment now isolated from machine where the submission was done

  • Revised tracking across chunks/stages done with the variable in the ROOT file

Sub-command routing

  • There are three ways to run the analysis
    • fccanalysis run my_analysis.py
    • python config/FCCAnalysesRun.py my_analysis.py
      • Can this way be dropped?
    • python my_analysis.py
  • Removed reliance on try/catch for sub-command routing

Code formatting

  • Currently, there is wide range of styles used
  • End goal: Make the analyzers better organized
    • They are building blocks of the analysis
  • Created CI to check every commit

  • LLVM Style selected based on popularity
  • Only changed lines are checked

Updated vertexing

  • Vertexing done with the help of code from Franco B.
  • Introduces dependency on Delphes
  • Introduces new analyzers: SmearedTracksdNdx, SmearedTracksTOF
  • Simplifies Delphes–EDM4hep unit gymnastic
  • Adds examples for Bs to Ds K

Building of FCCAnalyses

  • FCCAnalyses is a package in the Key4hep stack
  • Advanced users can work directly on their forks
    • Allows to keep the analysis ''cutting edge''
    • Requires discipline
  • Added helper sub-command: fccanalysis build

  • Current distribution mechanisms:
    • Using released version in Key4hep stack
    • Separate git repository + stable Key4hep stack
    • Separate git repository + nightlies stack

Key4hep stack pin

  • FCCAnalyses is developed on top of Key4hep stack
  • Sometimes depends on specific version of the package
  • Added helper sub-command: fccanalysis pin

  • Will pin the analysis to a specific version of the Key4hep stack
    • There is no patch mechanism in the Key4hep stack