FCC Dataset Management: Current Status

Juraj Smieško (CERN)

Disscussion of FCC Dataset Management System

08 September 2025

FCCAnalyses Overview

Analysis framework inside Key4hep ecosystem build
on top of the ROOT RDataFrame

FCCAnalyses place in the event reconstruction chain.
  • Provides standard library of functions
    • Many C++ HEP frameworks available
  • Automatic retrieval of metadata for the centrally produced datasets
    • Including the location and selection of the input files
  • Runs the ROOT RDataFrame
    • Local or remote execution
  • Helps with histograms/plots
    • Export of results to other tools
  • Registry for the analyses
    • Dedicated place for all case-studies
  • Minimal support for the Analysis facilities

Other analysis approaches

Python and Julia based analysis approaches

  • podio Python bindings
  • Coffea — awkward-arrays based columnar analysis framework
  • Julia: high-level, general-purpose dynamic programming language
  • C++: Raw podio/EDM4hep, RDataFrame with analyzers from FCCAnalyses
  • Mojo on the horizon
                
                  
                
              
Coffea logo Julia HEP logo

Centrally produced datasets

  • Datasets include generator level files, parametrized simulation and Fullsim samples for FCC-ee and FCC-hh
  • Stored on EOS at:
    /eos/experiment/fcc/<accelerator-type>/generation/
    /eos/experiment/fcc/prod/fcc/<accelerator-type>/
  • Detailed information about the samples published on FCC Physics Events website
    • Storage backend now uses the PostgreSQL database instead of plain JSON files
    • Search possible across the all datasets
    • Dynamic metadata fields
    • 2025 Summer Student Project of Marko Cechovic
  • FCCAnalyses framework automatically picks up the sample information from YAML and JSON files
  • Old FCC-hh Fullsim and Delphes samples moved to tape
    • Sample metadata kept on the website for historical record

EOS Space

Main EOS nodes:
/eos/experiment/fcc/
/eos/experiment/fcc/prod/

Analysis/User EOS nodes:
/eos/experiment/fcc/ee/analyses_storage/ /eos/experiment/fcc/users/

FCC-ee analysis space has 4 sub-directories:

  • BSM
  • EW_and_QCD
  • flavor
  • Higgs_and_TOP

Quotas:

  • Total available quota for analyses_storage is 140TB
  • Quota for users not defined
FCC Physics Events website.
Current usage and quotas for the FCC EOS nodes

EventProducer

Delphes samples
  • Homegrown solution — series of Python scripts to submit, validate and publish a dataset
  • Employing HTCondor for the computationally heavy generation
  • Supports only limited number of simulation workflows:
    • Generate LHE files from gripacks
    • Generate LHE files directly from MG5
    • Generate STDHEP files directly from Whizard + Pythia6
    • Generate EDM4hep files from the LHE and decay with Pythia8
    • Generate EDM4hep files from STDHEP
    • Generate EDM4hep files from Pythia8

DIRAC / iLCDirac

Interware to exploit distributed heterogeneous resources

DIRAC Logo iLCDirac Logo
DIRAC Overview
  • Created by LHCb, these days used by Belle2, ILC, CTAO, CEPC, …
  • High level interface to interact with Grids, Clouds, HPCs and Batch systems
  • Data / Metadata management
    • File transfers / replication, metadata augmented file catalog
  • Web interface to control aspects of the system
      Transformations, jobs, accounting, system administration
  • Resources organized into virtual organizations (VOs)
  • iLCDirac: set of extensions / applications developed for the linear collider studies

DIRAC / iLCDirac @ FCC

Resources available in the FCC Virtual Organization

  • FCC resources administered by FCC VO
    • Could be used by individual users and for centralized dataset preparation
    • Support e-group: fcc-vo-support@cern.ch
  • Storage Elements attached to FCC VO:
    • GLASGOW-DISK: Operational
    • CNAF-DISK: Operational
    • BARI-DISK: In progress
  • Started to exercise the system to produce a test campaign
    • Full campaign will be launched after EDM4hep 1.0 is finalized
    • Configuration kept in FCCDIRAC repository
  • Software needs to be wrapped into a DIRAC application
    • Many available thanks to iLCDirac
    • More added to support users/productions of FCC
      • More MC generators, interfaces for Key4hep based applications
DIRAC Applications
DIRAC DMS

Federico Stagni at 4th Rucio Community Workshop

DIRAC DMS

Federico Stagni at 4th Rucio Community Workshop

FCC DMS Expectations

FCC DMS

Backup

Key4hep

Coherent set of packages, tools, and standards for different collider Concepts

  • Common effort from FCC, CLIC/ILC, EIC, CEPC, Muon Collider, …
    • Preserves and adds onto existing functionality from iLCSoft, FCCSW, CEPCSW, …
    • Builds on top of the experience from LHC experiments and results of targeted R&D (AIDA, …)
    • Many institutes involved: CERN, DESY, IHEP, INFN, IJCLab, …
  • Each project rebases its stack on top of Key4hep
  • Having common building blocks enables synergies across collider communities
  • Main ingredients:
    • Event data model: EDM4hep, based on PODIO, AIDA project
    • Event processing framework: Gaudi, used in LHCb, ATLAS, …
    • Detector description: DD4hep, AIDA project
    • System to build, test and deploy: Spack, suggested by HSF + CVMFS
Key4hep design

EDM4hep

Common language for processing and persistifying data

EDM4hep diagram
  • Specification in a single YAML file
    • Describes standard data structures and relations between them
  • Generated by PODIO (developed as part of AIDA R&D)
  • Challenge: efficiency and thread safeness
  • Created by consensus
  • Trade-off between being generic and preserve compactness
  • Moving towards first stable LTS version (v1.0)

Analysis registry

Central registry for the FCC case studies

  • Two repositories:
  • Experimentally one can create analysis package for analysis specific code
  • Rudimentary and in need of overhaul
    • Based on the post FSR needs

FCC-ee case studies list. FCC-hh physics performance documentation.