Fabric Management at CERN BT July 16 th 2002 Tony.Cass@ CERN .ch

Fabric Managementat CERNBTJuly 16th 2002Tony.Cass@CERN.ch

The Problem ~6,000 PCs Only 1/3rd ofthe total capacityis at CERN… Grid Computing. Another ~1,000 boxes c.f. ~1,500 PCs and ~150 disk servers at CERN today.

The Past • Automated management tools developed to handle multi-architecture clusters with few tens of nodes. • Good points • Much automation • Solid set of tools • Much accumulated experience • Bad points • Can’t cope with number of systems we have today • Configuration information stored in multiple locations • Monitoring at system level, but users see service failures.

Where we are going • Use Linux standards • RPM, LSB, … • Single location(/interface) for configuration information • Which nodes in which clusters • Node roles, states, required software • Personnel roles (who is allowed to perform what) • Better Installation tools • Guaranteed reproducibility across nodes and over time • Making use of configuration information • Multiple distinct system “images” • Service level monitoring • Making use of configuration information • State Management for • System reconfiguration requests • Both system upgrades and reconfigurations to reflect workload changes • Automatic recovery procedures (and non-automatic if necessary…)

Fabric Management at CERN BT July 16 th 2002 Tony.Cass@ CERN .ch