1 / 20

September 8, 2012 Page 1

BRIITE Meeting Sept 5-7, 2012 What’s the big deal !? Deploying and Governing a Highly Scalable Scientific Computing Environment Dirk Petersen Scientific Computing Manager Fred Hutchinson, Seattle. September 8, 2012 Page 1. FHCRC Big IT Projects 2010-2013.

lfortson
Télécharger la présentation

September 8, 2012 Page 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BRIITE MeetingSept 5-7, 2012What’s the big deal !?Deploying and Governing a Highly Scalable Scientific Computing EnvironmentDirk PetersenScientific Computing ManagerFred Hutchinson, Seattle September 8, 2012 Page 1

  2. FHCRC Big IT Projects 2010-2013 … with impact on Scientific Computing (30-50 PI, ~300 users) September 8, 2012 Page 2

  3. Storage – how much space do we need? • What’s our growth … is it exponential? • Primary Storage usage is >550TB today and was 128TB in 2008 • Will we continue to grow at 45% per year? New genomic compression tools (cram, quip) may help. • Chris Dag: Not just more raw sequence data, downstream analyses also needs more storage (and less predictably) September 8, 2012 Page 3

  4. Storage – what will it costs us at 40% growth? • Forrester, Gartner & friends: Enterprise storage costs decrease 18% / year. • New reference: AWS Glacier $120/TB/Y and may decrease 3%/y (like S3) • After 10 years: NAS ~$8m, Glacier about ~$7m • Simplified: Ignore datacenter, staffing, backup & redundancy, power, cooling, transfer costs September 8, 2012 Page 4

  5. Storage – you don’t believe those 18%? • After all Forrester told us that S3 is really cheap … • OK, let’s say Enterprise storage costs decrease 10% / year. • After 10 years: NAS ~$13m, Glacier about ~$7m … but big scary graph in year 11 • The sky is not falling but cost and not technology limits scalability. • Costs for consumer grade components (e.g. drobo) fall rapidly, may be a solution? September 8, 2012 Page 5

  6. Storage – how fast does it have to be? • 50 nodes (20-25% of cluster) need push data at 1G = 5GB/S (5 x 10G Ethernet) • more users means more random IO access patterns • Cannot scale performance using only nearline storage • Multi-tier scaleout NAS or NAS virtualization • Chris Dag: Customers will no longer build one big scale-out NAS tier • Chris Dag: My ‘hack’ of using nearline spec storage as primary science tier is obsolete in 2012 • Chris Dag: Science changes faster than IT infrastructure • Chris Dag: ScaleoutNAS has won the battle (others: so we are stuck with NFS forever?) September 8, 2012 Page 6

  7. Storage – and when do you need your data back ? • Our brand new TSM system restores 4-6TB /day ---restoring 1PB from tape in case of disaster could easily take 6+ months. • Mirroring data is still expensive…. just buy a second enterprise NAS? • Most data protection systems are not intelligent (cannot restore the newest data first and leave the old stuff) • …..we really need to archive some data ….but how many low hanging fruits (folders with large old data) do we really find? • What do others think? • Is mirroring of online big data to another building the next step? • … or are we going to mirror big data off campus ? • …or are we going to keep our archives in a different building or off campus (e.g. Glacier) and expect our users to archive frequently ? September 8, 2012 Page 7

  8. Option 1: Scale-out NAS with multiple performance tiers Archive Tier Fast Storage Tiers Cloud • External collaboration • off-site protection • Shared data • Researcher Share • (NTFS) • Re-creatable Data • (Downloads, TCGA, etc) • Long term scratch (+30d) • Archive • Starfish • Big data shares (POSIX) • Short term scratch (30d) Use Cases NetApp for files that need NTFS ACL Enterprise Scale-Out NAS with fast and slow disks Archive Cloud NAS gateway e.g. with access to glacier Primary 15K SAS drives – 7.2K SATA drives – Non-HA NAS Fast File Moderate File Varonis auditing 30% Annual Growth RPO: 0 hour RTO: 0 hour Mirrored Data using rync or 2nd NAS CommVault backup Slow Backup/ Restore “Cold” Data Protection No Backup No Backup TSM incremental Non HA NAS & TSM incremental RPO: 24hours RTO: 24hours RPO: 0 hour RTO: 0 hour RTO: 72hours RPO: 24hours September 8, 2012 Page 8

  9. Option 2: NAS virtualization / acceleration with Scale-out NAS & low cost boxes Archive Tier Fast Storage Tiers Cloud • External collaboration • off-site protection • Shared data • Researcher Share • (NTFS) • Re-creatable Data • (Downloads, TCGA, etc) • Long term scratch (+30d) • Archive • Starfish • Big data shares (POSIX) • Short term scratch (30d) Use Cases NAS Virtualization/Acceleration Layer (30 day cache with fast disc, Transparent Data Migration/Replication, Global Name Space, scale up performance independent from capacity) NetApp for files that need NTFS ACL Moderate File – use 7.2K SATA drives only Archive Cloud NAS gateway e.g. with access to glacier Primary Non-HA low cost (250TB storage pods aka “thumpers”), 50% of Glacier cost Enterprise Scaleout nearline NAS Non-HA NAS Varonis auditing 50% Annual Growth 50% Annual Growth 0% Annual Growth 30% Annual Growth RPO: 0 hour RTO: 0 hour Mirrored Data using NAS virt CommVault backup Slow Backup/ Restore “Cold” Data Protection No Backup No Backup TSM incremental TSM incremental RPO: 24hours RTO: 24hours RPO: 0 hour RTO: 0 hour RTO: 72hours RPO: 24hours September 8, 2012 Page 9

  10. Storage – any governance ? • Do you ask users to formally request storage expansions ? • Do you track reasons for storage requests ? • Do you have a chargeback model? • What’s your threshold for ‘free’ vs ‘pay’ storage? • Do you charge below cost to steer utilization or do you charge full costs? • Do you charge for different tiers? • Is there a free archive? September 8, 2012 Page 10

  11. FHCRC Big IT Projects 2010-2013 … with impact on Scientific Computing (30-50 PI, ~300 users) September 8, 2012 Page 11

  12. FHCRC HPC approach then and now TO (2012) STORAGE COMPUTE STORAGE COMPUTE Archive (Old Fred) FROM (2010) Hutch Campus (G20 Data Center) Offsite, Tukwila LOG IN Hyrax (2700 core) LOG IN Horton Enterprise HPC Large Scientific Data (New Fred) LOG IN Hutch Campus Mercury (830 core) Fred LOG IN Office & Grant Data (Tungsten) Orca. Case, Rhino (160 core) PHS NetApp LOG IN 350 compute nodes, 3300 cores, 13 TB Memory, 9000 ECU (equivalent to $300-$400/h on EC2) SAS HPC Silo Tungsten • 3-4 separate HPC environments, 5-6 storage systems • Hyrax: Slurm, openSUSE11.3, Ganglia, AD/OpenLDAP • Mercury: Moab/Torque, CentOS5.5, XEN, AD/Centrify • Compute Hardware: Fujitsu, Dell, IBM • Storage: 3PAR, NetApp, Solaris, Windows, Gluster • Single HPC cluster, 100 public, rest private nodes • 4 login nodes / big mem machines for *all* hutch users • Stack: Moab/Slurm, Ubuntu 12.4, Puppet, Ganglia, AD/OpenLDAP, Orchestra, XEN for Windows/Legacy • Test/Dev cluster with prepackaged test cases September 8, 2012 Page 12

  13. FHCRC HPC approach 2012 -2013 … priorities and challenges September 8, 2012 Page 13

  14. Challenges HPC stability – what does it mean to be a commodity ? Improvements • DevOps config management tool (Puppet) to document & rapidly upgrade large numbers of systems with (slightly) varying configurations. • Switched to Ubuntu LTS with 5Y support (commercial optional) for “backports”, best tested dev environment for comp. biologists. • implemented change management process that balances the needs of pipeline user and ad-hoc users….. It’s a process  (virtualization?) • invested 100+ developer hours to fully integrate Peoplesoft / automated account scripts with Unix / HPC (creating AD test environments – no fun) • Hope: kill and resume … checkpointing of entire unix processes without user preparation, frameworks are DMTCP (user space) and BLCR (kernel space) . Storage is a challenge! • Consistency: Using traditional OS imaging to reproducibly deploy an OS to many slightly different machines is slow and error prone. • OS stability/familiarity: Intermittent kernel problems with unmaintained Linux (openSUSE) • Consistency: Change management does not mean the same to all customers; some want changes ASAP others detailed prep • Consistency: Different account management systems (inconsistent uid/gid) in multiple HPC environments cause work interruptions for users (sooner or later) • Resiliency: downtime does not matter per se but some jobs run for 12 days, small problems can stop them and make users unhappy, HA often not a solution, checkpointing unknown / unpopular September 8, 2012 Page 14

  15. DevOps / Config mgmt : Puppet / Chef are great! • Configuration and documentation become one, use scm (svn, git) to store your repositories, see who made what changes when and where. • It takes a long time to get going, system administrators need to work differently • Even manage software that does not like to be managed. For example: How do I make sure that we use one default version of java everywhere ? exec { "update-java-alternatives --set java-6-sun": unless => 'test $(readlink /etc/alternatives/java) == "/usr/lib/java6/jre/bin/java"', require => Package["sun-java6-jre"], } September 8, 2012 Page 15

  16. Challenges HPC performance – more than just horsepower ? Improvements • Significant storage performance improvements for Ubuntu 12.04 (I/O less dirty throttling in Kernel 3.2) • Ubuntu one of the few modern (Kernel 3.x) Linux OS with commercial support option. (extra benefit: 40k packages, biolinux, convergence?) • Todo: cgroups/cpuset to run only on alloc. cores • Scratch spaces: node local, global (nfs), monthly scratch, SSD, documentation & howtos • Test/dev cluster (previous generation hardware) on production network / storage continuously runs / measures most important user test cases • Kill & re-queue or suspend queues: a special queue for advanced users that can handle interrupted jobs, many cycles on private nodes. • performance bottlenecks caused by large sequential writes with Linux OS and nfs mounted storage • Offer a modern Linux OS, current tools and libs (not CentOS 5) • Users step on each other’s toes when running on the same node (e.g unintended multithreading) • Certain workflows (e.g. RNA seq) inefficient on nfs based storage • Unpredictable performance leads to user frustration, user apps run occasionally 2-4 times slower than normal (hardware, sharing , etc) • Accommodate very large performance requirements of small community September 8, 2012 Page 16

  17. Challenges HPC governance – let’s get the priorities right Improvements • Formed 2 oversight committees, one for faculty and one for Scientific Developers / Bioinformaticians..still varying level of interest. • Use Scheduler accounts, PI = account, users submit jobs to accounts, externals = no default. • Getting developers in one room really allows you to prioritize and reduce scope if time is priority. • Implement life cycle management: 3Y for cluster nodes and 2Y for login/large memory nodes. (bigger challenge for private nodes) • Develop Roadmap: buy more, augment with cloud (EC2) or throttle some (mostly big) users • Governance helped securing S10 grant in 2009. As appropriate governance is more common, how can you set yourself apart in the future? • Different communities. Faculty is interested in resource allocation, costs and strategies, Sci. developers in choices, users in automation. • Prioritize PIs not just users, make sure external users are prioritized correctly (fine tuning) • HPC Project timeline: increased scope (different OS) conflicts with HPC project timeline • Multiple generations of hardware as well as a mix of private (funded by PI) and public nodes can lead to unpredictable performance & reliability • Optimize & plan resource allocation as all resources are allocated 3-4 times / month. • Secure future HPC infrastructure grants • Integrate multiple Galaxy servers with HPC September 8, 2012 Page 17

  18. Appendix: Sharepoint Enterprise wiki • Moving away from Unix based wikis • Better collaboration with administrative staff, technical writers, etc • Communication & improved Documentation become much more important to SciComp users September 8, 2012 Page 18

  19. FHCRC IT code on github https://github.com/FredHutch/IT Some highlights in our public code repository September 8, 2012 Page 19

  20. Appendix: Utilization on priority (public) and backfill (public&private)queues September 8, 2012 Page 20

More Related