ROOT I/O: The Fast and Furious

ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,

Overview Several enhancements to ROOT I/O performance • Prefetching (a.k.a. TTreeCache) • Clustering the baskets • I/O challenges in CMS • Optimizing the streaming engine ROOT I/O: The Fast and Furious. October 2010

ROOT I/O Landscape Objects in memory Unzipped buffer Unzipped buffer Unzipped buffer Zipped buffer Zipped buffer Zipped buffer Network Remote Disk file Local Disk file

N N+3 N+1 N 1 2 N +1 ROOT I/O – Branches And Baskets Tree entries Streamer Branches Tree in file

Without Prefetching Baskets

Solution: TTreeCache T->SetCacheSize(cachesize); if (cachesize != 0) { T->SetCacheEntryRange(efirst,elast); T->AddBranchToCache(data_branch,kTRUE); // Request all the sub branches too T->AddBranchToCache(cut_branch,kFALSE); T->StopCacheLearningPhase(); } • Prefetches and caches a set of baskets (from several branches) • Designed to reduce the number of file read (or network messages) when reading a TTree (by several order of magnitude). • Configuration • Size of the reserved memory area • Set of branches to be read or range of entry to learn • Range of entry to read

TTreePerfStats void taodr(Int_t cachesize=10000000) { gSystem->Load("aod/aod"); //shared lib generated with TFile::MakeProject TFile *f = TFile::Open("AOD.067184.big.pool.root"); TTree *T = (TTree*)f->Get("CollectionTree"); Long64_t nentries = T->GetEntries(); T->SetCacheSize(cachesize); T->AddBranchToCache("*",kTRUE); TTreePerfStats ps("ioperf",T); for (Long64_t i=0;i<nentries;i++) { T->GetEntry(i); } T->PrintCacheStats(); ps.SaveAs("aodperf.root"); } ******TreeCache statistics for file: AOD.067184.big.pool_3.root ****** Number of branches in the cache ...: 9705 Cache Efficiency ..................: 0.973247 Cache Efficiency Rel...............: 1.000000 Learn entries......................: 100 Reading............................: 1265967732 bytes in 115472 transactions Readahead..........................: 0 bytes with overhead = 0 bytes Average transaction................: 10.963417 Kbytes Number of blocks in current cache..: 3111, total size: 2782963 Root > TFile f(“aodperf.root”) Root > ioperf.Draw()

With TTreeCache Gain a factor 6.5 !!! Old Real Time = 722s New Real Time = 111s The limitation is now cpu time

Better But …. Sequential read still requires some backward file seeks Still a few interleaved reads ROOT I/O: The Fast and Furious. October 2010

N N+3 N+1 N 1 2 N +1 ROOT I/O – Default Basket sizes Tree entries Streamer Branches Tree in file

OptimizeBaskets • Improve Basket Size • Default basket size was the same for all branches and tuning the size by hand is very time consuming. • TTree::OptimizeBaskets resizes the basket to even out the number of entry the basket of each branch and reduce the total memory use.

N N+1 N+2 N 1 2 3 4 N 1 2 3 4 ROOT I/O -- Split/Cluster Branches Tree entries Streamer N+4 N+3 N+6 N+5 Tree in file

Clustering (AutoFlush) Enforce “clustering” • Once a reasonable amount of data (default is 30 Mbytes) has been written to the file, all baskets are flushed out and the number of entries written so far is recorded in fAutoFlush. • From then on for every multiple of this number of entries all the baskets are flushed. • This insures that the range of entries between 2 flushes can be read in one single file read. • The first time that FlushBaskets is called, we also call OptimizeBaskets. • The TreeCache is always set to read a number of entries that is a multiple of fAutoFlush entries. No backward seeks needed to read file. Dramatic improvement in the raw disk IO speed.

N N+1 N+2 N 1 2 3 4 N 1 2 3 4 ROOT I/O -- Split/Cluster Branches Tree entries Streamer N+4 N+3 Tree in file

OptimizeBaskets, AutoFlush Solution, enabled by default: • Automatically tweak basket size! • Flush baskets at regular intervals!

CMS IO Changes 18 months ago averaged 3.88 reads, 138KB data per event and 35KB per read. For cosmics reconstruction with CMSSW_3_0_0 Since then: • Changed split level to zero. • Fixed TTreeCache usage in CMSSW. There was one cache per file and using multiple TTrees was resetting the cache every time we switched between TTrees. • Improved read order. Now averages .18 reads, 108KB data per event and 600KB per read. For event data reconstruction with CMSSW_3_9_x pre-release. This under-estimates the improvements as we are comparing ordered cosmics with unordered pp data. ROOT I/O: The Fast and Furious. October 2010

CMS IO Challenges One of the biggest issues tackled was read order. • Files come out of CERN with ROOT TTree ordering not equal to the CMS event # ordering. Thus they are read out-of-order; worst performing jobs read 20x too much data from skipping around in the tree. • In CMSSW_3_8_0, we now read each run/lumi in TTree order. • Huge performance boost on real data. • If the runs are contiguous, the whole file is read in order. • But this adds new constraints on how we merge files. TTreeCache performance: • Great for ntuple reading and reco-like jobs. • Needs to be improved for skims: • What if the training period was not representative of the use case? • We have little knowledge of when a skim will select a full event. ROOT I/O: The Fast and Furious. October 2010

CMS – Collaborating with ROOT Team Collaboration is very good because there is mutual interest and experts available. Lack of a CMS IO standard candle: • Makes it difficult for CMS to understand how changes in the file layout / workflow patterns affect IO performance. • Hinders CMS’s ability to communicate our needs with the ROOT team. ROOT IO debugging is challenging: • Tedious to backtrack curious IO patterns to the code causing it. • It takes a CMSSW and ROOT IO expert to understand what’s going on and communicate what needs to be fixed. • It took 2 years for CMS to notice, investigate and solve (with ROOT’s help) why TTreeCache didn’t work in CMSSW. • Statistics tell you when things are bad, but it takes an expert to figure out why or how to fix it. ROOT I/O: The Fast and Furious. October 2010

Memberwise Streaming x2 y1 y2 y3 z1 z2 z3 x1 x3 x2 y2 y1 z3 z1 y3 z2 x1 x3 • Used for split collection inside a TTree • Now the default for streaming the collection even when not split. • Better compression, faster read time Results for CMS files • some fully split • some unsplit ROOT I/O: The Fast and Furious. October 2010

Optimization Of Inner Loops ‘Generic’ switch statement used via a large template function for several cases: • Single object, Collection of pointers, Collection of objects, caching mechanism for schema evolution, split collection of pointers. • Improve code localization and reduce code duplication. Drawbacks: • Many initialization done ‘just’ in case Both intentionally and un-intentionally In at least one case the compiler optimization re-ordered the code resulting in more memory fetch being done than necessary. • Many if statements to ‘tweak’ the implementation at run-time • To stay generic the code could only dereference the address using operator[] hence relying on function overload. • For collection proxy, to reduce code duplication, operator[] was very generic. Prevent efficient use of ‘next’ in implementation of looping • Code generality required loops in all cases even to loop just once. ROOT I/O: The Fast and Furious. October 2010

Optimization Of Inner Loops Possible Solutions: Go further in using template functions by customizing the implementation of the switch cases depending on the ‘major’ case. Disadvantages: • Still could not optimize for specific collection (for example vector) because they are ‘hidden’ behind the TVirtualCollectionProxy • Can not go toward a complete template solution because it would not support the ‘emulated’ case. • Large switch statement still prevents some/most compilers from properly optimizing the code ROOT I/O: The Fast and Furious. October 2010

Optimization Of Inner Loops Solution: • Replace switch statement by a customized function call. Advantages: • Can add new implementation more easily. • Can customize the action for each of the specific cases: • No inner loop for single object. • Loop with known increment for vector of pointer and TClonesArray. • Loop with simple increment (vector and all emulated collections) • Loop using actual iterator for compiled collection. • Remove any if statement that can be resolved just by looking at the on-file and at the in-memory class layout. • Able to also strip out some of the functions (calls). • Outer loop is simpler and can now be overloaded in the various TBuffer implementation removing code needed only in special cases (XML and SQL). Disadvantages: • Increase code duplication (but our code has been very stable). ROOT I/O: The Fast and Furious. October 2010

Examples void TClass::Streamer(void *object, TBuffer &b) const { // Stream object of this class to or from buffer. switch (fStreamerType) { case kExternal: case kExternal|kEmulated: { ...; return; } case kTObject: { ...; return; } Etc. inline void Streamer(void *obj, TBuffer &b) const { // Inline for performance, skipping one function call. (this->*fStreamerImpl)(obj,b,onfile_class); } void TClass::StreamerDefault(void *object, TBuffer &b) const { // Default streaming in cases where Property() has not yet been called. Property(); // Sets fStreamerImpl (this->*fStreamerImpl)(object,b); } Before After ROOT I/O: The Fast and Furious. October 2010

Main Focuses Performance Both CPU and I/O (and more improvement to come) Consolidation Coverity, Valgrind, root forum, savannah Support ROOT I/O: The Fast and Furious. October 2010

Backups Slides ROOT I/O: The Fast and Furious. October 2010

Reading network files f = TFile::Open("http://root.cern.ch/files/AOD.067184.big.pool_4.root") f = TFile::Open("http://root.cern.ch/files/atlasFlushed.root") TR=Transactions NT=Network Time (latency + data transfer)

ROOT I/O: The Fast and Furious

ROOT I/O: The Fast and Furious

Presentation Transcript