Database and Monitoring

Database and Monitoring 1) Database It’s no secret that there were a lot of problems that we hope are behind us now. We tried to avoid downtime due to network problems which never happened - the network was never down, not even for 5 minutes. The setup with autonomous partitions was plagued with problems with our particular network setup (dual routes between machines). Needs Objy to do something eventually. Right now we are running the sole Phenix lock server in the countinghouse, so RCF would get hit by a network outage, but again, the network to here is much more stable than RCF itself. But the server is where it should be.

Putting run stuff into the DB retroactively I’m going to put the contents of the “ascii run info database” into the database retroactively, merging the info with the existing online logbook. I did that already once in the middle of the run, for the stuff that had been written then, but then the Objy problems hit. As soon as I find a morning or so to do it. All software is in place. Gives me time for a final “sit back and think it over”. Next time we will update the DB right away from the begin and endrun scripts. The above tools already work that way, reading back the info available to the scripts (which were just written out to an ascii file) We will presumably write to an ascii file in addition just in case.

New logbook snapshot Improved user interface Make and view comments to ANY entry Automatically inserted Run entries...

Note entry form Support for predefined keywords Support for “exact” formatting

Also... “Carriage return formatting” (which was preserved but not dealt with before) is restored for old entries Better graphics support - gif, jpg, png Who wants audio? PS? PDF? It’s doable... Day at a glance view Run information: trigger setup name events total and broken down by trigger Magnet on or off RHIC intensities when available (late in the run) In the future: trigger setup info file lists more anc. Stuff, HV, etc where possible

Monitoring The two packages (o- and pMonitor) are basically ok bogged down by DD problems most of the time stop-gap measures to fix it on our feet - rddEventiterator ChrisP will look into improving DD or replacing it (a potential candidate package from Jlab), probably will also look into a genuine client-server model DD’s shared memory paradigm isn’t the the best for our distributed computing environment, better suited for SMP’s (phoncs0). On the other hand, we may just get many dual-CPU Linux boxes, so we don’t discount shared memory as yet. I will finally publish the o/pMonitor manual... Probably I will need to provide more handholding for the subsystems to get the monitoring right.

RCF little-endian Data logging Could have been better. We don’t understand why we appear to top off at 8 MB/s. Puts a cap on how many events/s we can take. We wouldn’t have gotten a factor of 2 more, but some more. The hardware should (and did in the beginning) sustain > 14 or so MB/s easily but didn’t in the end. On the other hand, we are transferring the data in the almost optimally bad way… Event Builder little endian PHONCS0 big-endian HPSS

That’s the proposal... Move the logging off the big-endian Sun, move to a Linux box, keep the same endianess all the way through get rid of the “swap fee” Event Builder little endian Linux box little-endian HPSS RCF little-endian Also, get dedicated machines to do the logging and not much else, different from Phoncs0 as it was now.

A lucky break... As it so happens, ITD/RCF put two HPSS fibers in for us, one unused at this point. I requested and got the 2nd fiber and the ports on the RCF side. Next time we can run a dual buffer: Buffer box 1 Buffer box 1 Event Builder Event Builder HPSS fiber 2 Buffer box 2 Buffer box 2 HPSS fiber 1

What would we need? Two relatively beefy, but essentially headless Linux boxes 4x SCSI (Adaptec 94160) 3x Alteon Gigabit (1 we have in PCI already) Another 500 G disk array We have about 30K earmarked at RCF for this “buffer box”, to be spent at our discretion. (Was done already for Star, Brahms). That would put the money into disks, not an expensive machine… if it works. Remember, there’s no need to spend $$$ on 120MB/s if RCF is taking 20 only. 40MB/s gives a nice comfortable headroom.

Concerns 1) Logging would no longer be done on the “run control” machine • makes CORBA communication necessary between ORBIX and a third-party ORB on Linux • Ed is looking into that. IIOP should do the trick, but we have to show that it works. Other people do it, so I’m not too concerned. • All of the rest, ndd_event_server etc, already runs on Linux. 2) Can Linux do it? Is the file system good enough? • Commercial companies do it successfully, but still • a crash and subsequent fsck would take - what? hours? 30 minutes? • Does Linux have a journaling file system any time soon? • But then we didn’t have a journaling FS on Sun, well… • Maybe we can keep 100Gig in reserve to continue while we fsck?

But we can test... Without spending much money, we can do a 50% test. Get the bigdisk and the Alteon gigabit card off phoncs0. Take phoncsb or c. Get two SCSI adapters. Make software raid systems on the two halves of bigdisk. See if that works. Run the event builder and see the throughput to disk. Then hook up the gigabit to HPSS and see that IO rate. Play fsck games while we have 2 * 250Gig. Make one 500G striped FS and fsck Repeat throughput test w/ 500 Gig filesystem I’d propose to test that after Sep 19 when all goes down anyway. If it works, we delay the disk purchase until January/Feb to get more disk for the $ ditto for the machines, let’s see what you can get then at commodity prices.

If that does not work... …well, then it’s back to a higher-end what?” SGI? Sun 450? We’d get the endianess problem back… And we’d spend a lot more money on computer rather than on disk. We’ll see. I’m optimistic.

Database and Monitoring