Computing for Belle

Computing for Belle CHEP2004 September 27, 2004 Nobu Katayama KEK

Outline • Belle in general • Software • Computing • Production • Networking and Collaboration issues • Super KEKB/Belle/Belle computing upgrades Nobu Katayama(KEK)

Belle detector Nobu Katayama(KEK)

Integrated luminosity • May 1999:first collision • July 2001:30fb-1 • Oct. 2002:100fb-1 • July 2003:150fb-1 • July 2004:287fb-1 • July 2005:550fb-1 Super KEKB 1~10ab-1/year! ~1 fb-1/day ! >1TB raw data/day more than 20% is hadronic events Nobu Katayama(KEK)

Continuous Injection No need to stop run Always at ~max. currents and therefore maximum luminosity [CERN courier Jan/Feb 2004] ~30% more Ldt both KEKB & PEP-II continuous injection (new) normal injection (old) HER current LER current Luminosity ~1 fb-1/day ! (~1x106 BB) - 0 12 24 Time Nobu Katayama(KEK)

Summary of b  sqq CPV - “sin2f1” Something new in the loop?? 2.4 We need a lot more data (A: consistent with 0) Nobu Katayama(KEK)

The Belle Collaboration Seoul National U. Shinshu U. Sungkyunkwan U. U. of Sydney Tata Institute Toho U. Tohoku U. Tohuku Gakuin U. U. of Tokyo Tokyo Inst. of Tech. Tokyo Metropolitan U. Tokyo U. of A and T. Toyama Nat’l College U. of Tsukuba Utkal U. VPI Yonsei U. Aomori U. BINP Chiba U. Chonnam Nat’l U. Chuo U. U. of Cincinnati Ewha Womans U. Frankfurt U. Gyeongsang Nat’l U. U. of Hawaii Hiroshima Tech. IHEP, Beijing IHEP, Moscow IHEP, Vienna ITEP Kanagawa U. KEK Korea U. Krakow Inst. of Nucl. Phys. Kyoto U. Kyungpook Nat’l U. U. of Lausanne Jozef Stefan Inst. U. of Melbourne Nagoya U. Nara Women’s U. National Central U. Nat’l Kaoshiung Normal U. Nat’l Lien-Ho Inst. of Tech. Nat’l Taiwan U. Nihon Dental College Niigata U. Osaka U. Osaka City U. Panjab U. Peking U. Princeton U. Riken-BNL Saga U. USTC 13 countries, ~57 institutes, ~400 members Nobu Katayama(KEK)

Collaborating institutions • Collaborators • Major labs/universities from Russia, China, India • Major universities from Japan, Korea, Taiwan, Australia… • Universities from US and Europe • KEK dominates in one sense • 30~40 staffs work on Belle exclusively • Most of construction and operating costs are paid by KEK • Universities dominates in another sense • Young students to stay at KEK, help operations, do physics analysis • Human resource issue • Always lacking man power Nobu Katayama(KEK)

Software

Core Software • OS/C++ • Solaris 7 on sparc and RedHat 6/7/9 on PCs • gcc 2.95.3/3.0.4/3.2.2/3.3 (code compiles with SunCC) • No commercial software except for batch queuing system and hierarchical storage management system • QQ, EvtGen, GEANT3, CERNLIB (2001/2003), CLHEP(~1.5), postgres 7 • Legacy FORTRAN code • GSIM/GEANT3/ and old calibration/reconstruction code) • I/O:home-grown stream IO package + zlib • The only data format for all stages (from DAQ to final user analysis skim files) • Index file (pointer to events in data files) are used for final physics analysis Nobu Katayama(KEK)

Framework (BASF) • Event parallelism on SMP (1995~) • Using fork (for legacy Fortran common blocks) • Event parallelism on multi-compute servers (dbasf, 2001~, V2, 2004) • Users’ code/reconstruction code are dynamically loaded • The only framework for all processing stages (from DAQ to final analysis) Nobu Katayama(KEK)

The original design of our framework, BASF, allowed to have an appropriate IO package loaded at run time Our (single) IO package grew to handle more and more possible ways of IOs (disk, tape, etc.) and was extended to deal with special situations and became spaghetti-ball like code This summer we were faced again to extend it to handle new HSM and for tests with new software We finally rewrote our big IO package into small pieces, making it possible to simply add one derived class for one IO method Data access methods Basf_io class derived classes … Basf_io_disk Basf_io_tape Basf_io_zfserv Basf_io_srb IO objects are dynamically loaded upon request Nobu Katayama(KEK)

Reconstruction software • 30~40 people have contributed in the last several years • For many parts of reconstruction software, we only have one package. Very little competition • Good and bad • Identify weak points and ask someone to improve them • Mostly organized within the sub detector groups • Physics motivated, though • Systematic effort to improve tracking software but very slow progress • For example, 1 year to get down tracking systematic error from 2% to less than 1% • Small Z bias for either forward/backward or positive/negative charged tracks • When the problem is solved we will reprocess all data again Nobu Katayama(KEK)

Analysis software • Several ~ tens of people have contributed • Kinematical and vertex fitter • Flavor tagging • Vertexing • Particle ID (Likelihood) • Event shape • Likelihood/Fisher analysis • People tend to use standard packages but… • System is not well organized/documented • Have started a task force (consisting of young Belle members) Nobu Katayama(KEK)

Postgresql database system • The only database system Belle uses • other than simple UNIX files and directories • A few years ago, we were afraid that nobody uses postgresql but it seems postgresql is now widely used and well maintained • One master, several copies at KEK, many copies at institutions/on personal PCs • ~120,000 records (4.3GB on disk) • IP (Interaction point) profile is the largest/most popular • It is working quite well although consistency among many database copies is the problem Nobu Katayama(KEK)

BelleG4 for (Super) Belle • We have also started (finally) building Geant4 version of the Belle detector simulation program • Our plan is to build the Super Belle detector simulation code first, then write reconstruction code • We hope to increase number of people who can write Geant4 code in Belle and in one year or so, we can write Belle G4 simulation code and compare G4, G3 and real data • Some of the detector code is in F77 and we must rewrite them Nobu Katayama(KEK)

Computing

Computing Equipment budgets • Rental system • Four  five year contract (20% budget reduction!) • 1997-2000 (25Byen;~18M euro for 4 years) • 2001-2005 (25Byen;~18M euro for 5 years) • A new acquisition process is starting for 2006/1- • Belle purchased systems • KEK Belle operating budget 2M Euro/year • Of 2 M Euro, 0.4~1Meuro/year for computing • Tapes(0.2MEuro), PCs(0.4MEuro) etc. • Sometimes we get bonus(!) • so far in five years we got about 2M Euro in total • Other institutions • 0~0.3Meuro/year/institution • On the average, very little money allocated Nobu Katayama(KEK)

Rental system(2001-2005) total cost in five years (M Euro) Nobu Katayama(KEK)

Belle’s reference platform Solaris 2.7 Everyone has account 9 workgroup servers (500Hz, 4CPU) 38 compute servers 500GHz, 4CPU LSF batch system 40 tape drives (2 each on 20 servers) Fast access to disk servers 20 user workstations with DAT, DLT, AITs Maintained by Fujitsu SE/CEs under the big rental contract Compute servers (@KEK, Linux RH 6.2/7.2/9) User terminals (@KEK to log onto the group servers) 120 PCs (~50Win2000+X window sw, ~70 Linux) User analysis PCs(@KEK, unmanaged) Compute/file servers at universities A few to a few hundreds @ each institution Used in generic MC production as well as physics analyses at each institution Tau analysis center @ Nagoya U. for example Sparc and Intel CPUs Nobu Katayama(KEK)

PC farm of several generations Total # of GHz Integrated Luminosity Dell 36PCs (Pentium-III ~0.5GHz) Compaq 60PCs (Pentium-III 0.7GHz) 99 00 01 02 03 04 168GHz Dell 150PCs (Xeon 3.4GHz×2) 320GHz 1020GHz Fujitsu 127PCs (Pentium-III 1.26GHz) Coming soon! 377GHz 768GHz 470GHz Fujitsu 120PCs (Xeon 3.2GHz×2) Appro 113PCs (Athlon 1.67GHz×2) NEC 84PCs (Xeon 2.8GHz×2) Nobu Katayama(KEK)

Disk servers@KEK • 8TB NFS file servers • Mounted on all UltraSparc servers via GbE • 4.5TB staging disk for HSM • Mounted on all UltraSparc/Intel PC servers • ~15TB local data disks on PCs • generic MC files are stored and used remotely • Inexpensive IDE RAID disk servers • 160GB  (7+1)  16 = 18TB @ 100K Euro (12/2002) • 250GB  (7+1)  16 = 28TB @ 110K Euro (3/2003) • 320GB  (6+1+S)  32 = 56TB @ 200K Euro (11/2003) • 400GB  (6+1+S)  40 = 96TB @ 250K Euro (11/2004) • With a tape library in the back end Nobu Katayama(KEK)

Tape libraries • Direct access DTF2 tape library • 40 drives (24MB/s each) on 20 UltraSparc servers • 4 drives for data taking (one is used at a time) • Tape library can hold 2500 tapes (500TB) • We store raw data and dst using • Belle tape IO package (allocate, mount, unmount, free) • perl script (files to tape database), and • LSF (exclusive use of tapes) • HSM backend with three DTF2 tape libraries • Three 40TB tape libraries are back-end of 4.5TB disks • Files are written/read as all are on disk • When the capacity of the library becomes full, we move tapes out of the library and human operators insert tapes when requested by users by mails Nobu Katayama(KEK)

Data size so far • Raw data • 400TB written since Jan. 2001 for 230 fb-1 of data on 2000 tapes • DST data • 700TB written since Jan. 2001 for 230 fb-1 of data on 4000 tapes, compressed with zlib • MDST data (four vectors, verteces and PID) • 15TB for 287 fb-1 of hadronic events (BBbar and continuum), compressed with zlib • t, two photon: add 9TB for 287 fb-1 Total ~1PB in DTF(2) tapes plus 200+TB in SAIT DTF(2) tape usages DTF2(200GB) tapes 2001-2004 Number of tapes DTF(40GB) tapes from 1999-2000 • 2002 2003 2004 Nobu Katayama(KEK)

Mass storage strategy • Development of next gen. DTF drives was canceled • SONY’s new mass storage system will use SAIT drive technology (Metal tape, helical scan) • We decided to test it • Purchased a 500TB tape library and installed it as the backend (HSM) using newly acquired inexpensive IDE based RAID systems and PC file servers • We are moving from direct tape access to hierarchical storage system • We have learned that automatic file migration is quite convenient • But we need a lot of capacity so that we do not need operators to mount tapes Nobu Katayama(KEK)

Compact, inexpensive HSM • The front-end disk system consists of • 8 dual Xeon PC servers with two SCSI channels • Each connecting one 16 320GB IDE disk RAID system • Total capacity is 56TB (1.75TB(6+1+S)228) • The back-end tape system consists of • SONY SAIT Petasite tape library system in three rack wide space; main system (four drives) + two cassette consoles with total capacity of 500 TB (1000 tapes) • They are connected by a 16 port 2Gbit FC switch • Installed in Nov. 2003~Feb. 2004 and is working well • We keep all new data on this system • Lots of disk failures (even power failures) but system is surviving • Please see my poster Nobu Katayama(KEK)

Commercial software from SONY We have been using it since 1999 on Solaris It now runs on Linux (RH7.3 based) It used Data Management API of XFS developed by SGI When the drive and the file server is connected via FC switch, the data on disk can directly be written to tape Minimum size for migration and the size of which the first part of the file remains on disk can be set by users The backup system, PetaBack, works intelligently with PetaServe; In particular, in the next version, if the file has been shadowed (one copy on tape, one copy on disk), it will not be backed up, saving space and time So far, it’s working extremely well but Files must be distributed by ourselves among disk partitions (now 32, in three months many more as there is 2TB limit…) Disk is a file system and not cache HSM disk can not make an effective scratch disk There is a mechanism to do nightly garbage collection if the users delete files by themselves PetaServe: HSM software Nobu Katayama(KEK)

Extending the HSM system • The front-end disk system will include • ten more dual Xeon PC servers with two SCSI channels • Each connecting one 16 400GB RAID system • Total capacity is 150TB (1.7TB(6+1+S)228) +(2.4TB(6+1+S)2210) • The back-end tape system will include • Four more cassette consoles and eight more drives; • total capacity is 1.29 PB (2500 tapes) • 32 port 2Gbit FC switch will be added (32+16 inter connected with 4 ports) • Belle hopes to migrate all DTF2 data into this library by the end of 2005 • 20TB/day! data must be moved from the old tape library to the new tape libraray Nobu Katayama(KEK)

150 150 150 EX EX EX D B C Floor space required DTF2 library systems Lib（~500TB） ~141㎡ BHSM（~150TB） ~31㎡ Backup（~13TB） ~16㎡ D D D D D D D D 12,285mm B J D D D 3,100mm C C C 3,970mm C SAIT library systems C 2,950mm 3,325mm C 200BF 200C 200C 200C 200C 200C 200C C 4,792mm 6,670mm Nobu Katayama(KEK) 11,493mm

Use of LSF • We have been using LSF since 1999 • We started using LSF on PCs since 2003/3 • Of ~1400 CPUs (as of 2004/11), ~1000 CPU are under LSF and being used for DST production, generic MC generation, calibration, signal MC generation and user physics analyses • DST production uses it’s own distributed computing (dbasf) and child nodes do not run under LSF • All other jobs share the CPUs and dispatched from LSF • For users we use fair share scheduling • We will be trying to evaluate new features in LSF 6.0 such as report and multi-clusters hoping to use it a collaboration wide job management tool Nobu Katayama(KEK)

Rental system plan for Belle • We have started the lengthy process of the computer acquisition for 2006 (-2010) • 100,000 specCINT2000_rates (or whatever…) compute servers • 10PB storage system with extensions possible • fast enough network connection to read/write data at the rate of 2-10GB/s (2 for DST, 10 for physics analysis) • User friendly and efficient batch system that can be used collaboration wide • Hope to make it a Grid capable system • Interoperate with systems at remote institutions which are involved in LHC grid computing • Careful balancing of lease and yearly-purchases Nobu Katayama(KEK)

Production

As we take data, we write raw data directly on tapes (DTF2), at the same time we run dst production code (event reconstruction) using 85 dual Athlon 2000+ PC servers See Itoh san’s talk on RFARM on 29th in Online computing session Using the results of event reconstruction we send feedback to the KEKB accelerator group such as location and size of the collision so that they can tune the beams to maximize instantaneous luminosity, keeping them collide at the center We also monitor BBbar cross section very precisely and we can change the machine beam energies by 1MeV or so to maximize the number of BBbars produced The resulting DST are written on temporary disks and skimmed for detector calibration Once detector calibration is finished, we run the production again and make DST/mini-DST for final physics analysis Online production and KEKB Luminosity Vertex Z position Changed “RF phase” corresponding to .25 mm Nobu Katayama(KEK)

Dual Athron 1.6GHz 3com 4400 input distributor Dell CX600 Dual PIII(1.3GHz) output collector ..... Dual Xeon(2.8GHz) Planex FMG-226SX disk server DNS NIS 3com 4400 85 nodes control Dual PIII(1.3GHz) Online DST production system DTF2 PCs are operated by Linux E-hut 1000base/SX Computer Center 100base-TX Belle Event Builder Server room Control room (170 CPUs) Nobu Katayama(KEK)

Online data processing Level 3 software trigger Recorded 4710231 events (52%) Used 186.6[GB] Run 39-53 (Sept. 18, 2004) 211 pb-1 accumulated Accepted 9,029,058 events Accept rate 360.30Hz Run time 25,060s Peak Lum was 91033, we started running from Sept. 9 DTF2 tape library RFARM 82 dual Athlon 2000+ servers More than 2M BBbar events! Total CPU time: 1812887 sec L4 outputs: 3980994 events 893416 Hadrons DST:381GB(95K/ev) Skims: m pair 115569 (5711 are written) Bhabha: 475680 (15855) Tight Hadron 687308 (45819) New SAIT HSM Nobu Katayama(KEK)

Reprocessing strategy Goal:3 months to reprocess all data using all KEK compute servers Often we have to wait for constants Often we have to restart due to bad constants Efficiency:50~70% History 2002: Major software updates 2002/7 Reprocessed all data till then (78fb-1 in three months) 2003: No major software updates 2003/7 Reprocessed data taken since 2002/10 (~60fb-1 in three months) 2004: SVD2 was installed in summer 2003 Software is all new and being tested till mid May 2004/7 Reprocessed data taken since 2003/10 (~130fb-1 in three months) DST production Nobu Katayama(KEK)

Elapsed time (actual hours) Dbase update Belle incoming data rate (1fb-1/day) dE/dx Setup ToF (RFARM)

Comparison with previous system -2004 -2003 5 fb-1/day See Adachi/Ronga poster Using 1.12THz total CPU we achieved >5 fb-1/day Lprocessed/day(pb-1) day New! Old…

Mainly used for physics background study 400GHz Pentium III~2.5fb-1/day 80~100GB/fb-1 data in the compressed format No intermediate (GEANT3 hits/raw) hits are kept. When a new release of the library comes, we try to produce new generic MC sample For every real data taking run, we try to generate 3 times as many events as in the real run, taking Run dependence Detector background are taken from random trigger events of the run being simulated into account At KEK, if we use all CPUs we can keep up with the raw data taking (3) MC production We ask remote institutions to generate most of MC events We have generated more than 2109 events so far using qq 100TB for 3300 fb-1 of real data Would like to keep on disk generic MC production M events produced since Apr. 2004 Nobu Katayama(KEK)

Pseudo random number can be difficult to manage when we must generate billions of events in many hundred thousands of jobs. (Each event consumes about one thousand random numbers in one mille second. Encoding the daytime as the seed may be one solution but there is always a danger that one sequence is the same as another We tested a random generator board using thermal noises. The board can generate up to 8M real random numbers in one second We will have random number servers on several PC servers so that we can run many event generator jobs in parallel Random number generator Nobu Katayama(KEK)

Network/data transfer

KEKB computer system has a fast internal network for NFS We have added Belle bought PCs, now more than 500 PCs and file servers We have connected Super-SINET dedicated 1Gbps lines to four universities We have also requested that we connect this network to outside KEK for tests of Grid computing Finally, a new Cisco 6509 has been added to separate the above three networks A firewall and login servers make the data transfer miserable (100Mbps max.) DAT tapes to copy compressed hadron files and MC generated by outside institutions Dedicated GbE network to a few institutions are now being added Total 10Gbit to/from KEK being added Still slow network to most of collaborators Networks and data transfer Nobu Katayama(KEK)

Data transfer to universities We use Super-SINET, APAN and other international Academic networks as the back bone of the experiment e+e- Bo Bo Tsukuba Hall Belle Tohoku U 400 GB/day ~45 Mbps ~ 1TB/day NFS 10Gbps Osaka U Australia U.S. Korea Tiwan Erc. 1TB/day ~100Mbps 170 GB/day KEK Computing Research centor U of Tokyo Nagoya U TIT Nobu Katayama(KEK)

Belle hasn’t made commitment to Grid technology (We hope to learn more here) Belle@KEK has not but Belle@remote institutions might have been involved, in particular, when they are also involved in one of the LHC experiments Parallel/distributed computing aspect Nice but we do have our own solution Event (trivial) parallelism works However, as we accumulate more data, we hope to have more parallelism (several tens to hundreds to thousands of CPUs) Thus we may adopt a standard solution Parallel/distributed file system Yes we need something better than many UNIX files on many nodes For each “run” we create ~1000 files; we have had O(10,000) runs so far Collaboration wide issues WAN aspect Network connections are quite different from one institution to another authentication We use DCE to login We are testing VPN with OTP CA based system would be nice resource management issues We want submit jobs to remote institutions if CPU is available Grid Plan for Belle Nobu Katayama(KEK)

Belle’s attempts • We have separated stream IO package so that we can connect to any of (Grid) file management packages • We started working with remote institutions (Australia, Tohoku, Taiwan, Korea) • See our presentation by G. Molony on 30th in Distributed Computing session • SRB (Storage Resource Broker by San Diego Univ.) • We constructed test beds connecting Australian institutes, Tohoku university using GSI authentication • See our presentation by Y. Iida on 29th in Fabric session • gfarm (AIST) • We are using gfarm (cluster) at AIST to generate generic MC events • See presentation by O. Tatebe on 27th in Fabric session Nobu Katayama(KEK)

Human resources • KEKB computer system+Network • Supported by the computer center (1 researcher, 3~4 system engineers+1 hardware engineer, 2~3 operators) • PC farms and Tape handling • 1 KEK/Belle researcher, 2 Belle support staffs (they help productions as well) • DST/MC production management • 2 KEK/Belle researchers, 1 pos-doc or student at a time from collaborating institutions • Library/Constants database • 2 KEK/Belle researchers + sub detector groups We are barely surviving and are proud of what we have been able to do; make the data usable for analyses Nobu Katayama(KEK)

Super KEKB, S-Belle and Belle computing upgrades

The Standard Model of particle physics cannot really explain the origin of the universe It does not have enough CP violation Models beyond the SM have many CP violating phases The is no principle to explain flavor structure in models beyond the Standard model Why Vcb>>Vub What about the lepton sector? Is there a relationship between quark and neutrino mixing matrices? We know that Standard Model (incl. Kobayashi-Maskawa mechanism) is the effective low energy description of Nature. However most likely New Physics lies in O(1) TeV region. LHC will start within a few years and (hopefully) discover new elementary particles. Flavor-Changing-Neutral-Currents (FCNC) suppressed New Physics w/o suppression mechanism excluded up to 103 TeV.  New Physics Flavor Problem Different mechanism  different flavor structure in B decays (tau, charm as well) More questions in flavor physics Luminosity upgrade will be quite effective and essential. Nobu Katayama(KEK)

Roadmap of B physics Tevatron (m~100GeV)  LHC (m~1TeV) KEKB (1034)  SuperKEKB (1035) Concurrent program Identification of SUSY breaking mechanism Anomalous CPV in bgsss if NP=SUSY sin2f1, CPV in Bgpp, f3,Vub, Vcb, bgsg, bgsll, new states etc. Study of NP effect in B and t decays time or integrated luminosity Precise test of SM and search for NP Yes!! 2001 NP discovered at LHC (2010?) Discovery of CPV in B decays Now 280 fb-1 Nobu Katayama(KEK)

Projected Luminosity 50ab-1 “Oide scenario” SuperKEKB Nobu Katayama(KEK)

Computing for Belle