100 likes | 212 Vues
FAST-OS: Petascale Single System Image. Jeffrey S. Vetter (PI) Computer Science and Mathematics Computing & Computational Sciences Directorate Team Members Nikhil Bhatia, Collin Mccurdy, Hong Ong, Phil Roth, WeiKuan Yu. PetaScale single system image. What?
E N D
FAST-OS: Petascale Single System Image Jeffrey S. Vetter (PI)Computer Science and MathematicsComputing & Computational Sciences Directorate Team Members Nikhil Bhatia, Collin Mccurdy, Hong Ong, Phil Roth, WeiKuan Yu
PetaScale single system image • What? • Make a collection of standard hardware look like one big machine, in as many ways as feasible • Why? • Provide a single solution to address all forms of clustering • Simultaneously address availability, scalability, manageability, and usability • Components • Virtual cluster using contemporary hypervisor technology • Performance and scalability of parallel shared root file system • Paging behaviors of applications, i.e., reducing TLB misses • Reducing “noise” from operating systems at scale
Virtual clusters using contemporary hypervisor technology • Goal: Virtual cluster of domUs on multiple dom0s • Enable preemptive migration of OS images for reliability, maintenance, load-balancing • Have virtual login node images, virtual service node images, and virtual compute node images (x86_64, FC5-based) • Initial experimentation on two x86_64 RHEL4 hosts • Collect per-node/per-domain high-level information about host and guest domains • Use a copy-on-write scheme at NFS server • Unionfs-based root file system trees, one per virtual cluster node • Investigate oprofile for active profiling • Performance data should help make migratory decisions • Investigate functionality of Xentop and the LibVirt API for scalable management of domUs • Develop management console for manual and algorithmic control
High-speed migration of virtual clusters with InfiniBand (IB) Goal: Reduce impact of virtualization on high-performance computing application performance • High bandwidth, low latency • Effective migration of OS Phase 1: Xen virtualization with IB • IBM and OSU have a prototype version, functioning over IA32 • Functioning over IA32, support only IB verbs and IPoIB • Need to port to x86_64 • Need support for other IB components Phase 2: Xen migration over IB, prototype under • Migration of IB-capable Xen domains over TCP • Snapshot and close IB domains before migration • Open and reconnect after migration • Address issues in handling user-space IB info • Protection key • IB context, queue pair numbers and LID • Need to coordinate static domains, too Phase 3: Migration of IB-capable Xen domains over IB, live migration
Paging behavior of DOE applications Goal: Understand paging behavior of applications in large scale system Trace-based memory subsystem simulator • Performs dynamic tracing using PIN (Intel’s) • For each data memory reference, simulator • Executes code that looks up data in caches/TLB/PTEs • Estimates latency of misses • Produces estimate of cycles spent in memory subsystem • Offers these advantages of simulation • Allows evaluation of nonexistent configurations: • What if paging were turned off? • What if Opteron used Xeon smaller-L1-cache-for-larger-TLB tradeoff? Potential weakness: inaccuracy • Sequential vs parallel memory accesses • Intel (simulator) vs Portland Group (XT3) compiler • Different runtime libraries (MPI)
Applications and methodology Applications: • High-Performance Computing and Communications benchmarks • Sequential: STREAM, DGEMM, FFT, RandomAccess • Parallel: MPI_RandomAccess, HPL, FFT, PTRANS • Climate:POP, CAM, HYCOM • Fusion: GYRO, GTC • Molecular Dynamics: AMBER, LAMMPS Methodology: • 16p version of each benchmark • Unmeasured: init code, warm up caches; measured: three time steps • Eight variants (results normalized to HW-large): • HW: large and small pages • Simulated Opteron: large, small, and no pages • Simulated Opteron/Xeon hybrid: large, small, no pages
Example results: HYCOM Initial simulation results match measurement and illustrate differences in page configurations
Parallel root file system Goals of study • Use a parallel file system for implementation of shared root environment • Evaluate performance of parallel root file system • Understand root I/O access pattern and potential scaling limits Current status • RootFS implemented using NFSv4, PVFS-2, and Lustre 1.4.6 • RootFS distributed using ramdisk implementation • Modified mkinitrd program locally (pvfs2 and lustre) • Modified to init scripts to mount root at boot time (pvfs2 and lustre) • Configured compute nodes to run small Linux kernel • Released two Lustre 1.4.6.4 patches for kernel 2.6.15 • File systems performance measured using IOR, b_eff_io, and NPB I/O
Contacts Jeffery S. Vetter Principle InvestigatorComputer Science and Mathematics Computing & Computational Sciences Directorate(865) 356-1649 vetter@ornl.gov Hong Ong(865) 241-1115 hongong@ornl.gov Nikhil Bhatia(865) 241-1535 bhatia@ornl.gov Collin McCurdy(865) 241-6433 cmccurdy@ornl.gov Phil Roth (865) 241-1543 rothpc@ornl.gov WeiKuan Yu (865) 574-7990 wyu@ornl.gov 10 Ong_SSI_0611