160 likes | 173 Vues
This report provides an overview of the genome sequencing center's data flow and storage system, including information on instrument-specific raw data, DNA sample tracking, data collection, database growth, storage growth, data processing, and recent changes.
 
                
                E N D
Genome Sequencing Center Site Report - HEPiX Fall 2007 Gary Stiehr garystiehr@wustl.edu
Instrument-specific raw data (various formats) General Data Flow Attached computer system or cluster Our disk arrays Various disk arrays and the cluster for analysis garystiehr@wustl.edu
DNA sample preparation and movement is carefully tracked:75+ Debian Linux systems with touch screens and barcode scanners for lab technician input.OLTP schema on Oracle 10g RAC running across four Infiniband-connected Sun X4100, each with four cores and 16 GB of memory using 15 TB NetApp FAS980. Sample Tracking garystiehr@wustl.edu
The bulk of our data is from DNA sequencing instruments scattered throughout the lab.In most cases, the data’s first stop is a locally attached, vendor-provided system or cluster. 235+ Windows systems, some LinuxPreviously, data produced by the sequencers stored mostly in Oracle databases.With newer sequencers, we store only tracking data in Oracle; raw data is on the file system. Data Collection garystiehr@wustl.edu
DW: Currently 15.3 TB + 360 GB per monthOLTP: Currently 1.6 TB + 27 GB per month OLAP: Currently 183 GB + 7.5 GB per month Database Growth garystiehr@wustl.edu
Over the last year, migrating database instances to Sun X4100 servers (16 GB RAM, two dual-core Opteron 285 processors).Oracle 10g RAC used for DW (2 nodes) and OLTP (4 nodes) connected via Infiniband.Running RedHat to be within Oracle, Cisco support matrix. Database Servers garystiehr@wustl.edu
5000% increase since last year; Not counting user analysis (i.e., only production analysis): Incoming Data Growth garystiehr@wustl.edu
Storage space available for production data storage and archiving: Storage Growth garystiehr@wustl.edu
Currently utilizing NetApp and BlueArc as NAS.Older SAN infrastructure utilizes EMC, Hitachi and StorageTek.Two 700-slot StorageTek L700 tape libraries. SDLT drives (on their way out) T10K drives (may test T10K-B drives)NetBackup used for backups. Current Storage garystiehr@wustl.edu
Some sequencers ship with software to run on in-house clusters.Need to customize to fit local environment. Makefile-based parallelism. Many small independent jobs. Find right granularity.Other sequencers ship with a four or five node cluster and tens of TB of disk.Power/cooling issues for multiple instruments? Data Processing garystiehr@wustl.edu
Platform LSF HPC manages compute nodes. Processing Capacity garystiehr@wustl.edu
96 GB mem, 4 Itaniums Large datasets utilizing more memory. Large Memory Systems New applications asking for even more memory in some cases. garystiehr@wustl.edu
Looking at Sun X4600 servers:8 server boards each, currently with one dual-core Opteron processor and 64 GB each.Configuration: 16 Opteron cores, 256 GB of memory, four 146 GB SAS drives.At full CPU utilization according to Sun’s Power Calculator = 1.14 kW (around 9.5 A @ 120 V). Large Memory Systems garystiehr@wustl.edu
Wider deployment of Ganglia for monitoring.Migration to LDAP-based authentication (from NIS).Out of physical space, cooling in current data center (as well as weight limitations).Creating a disaster recovery environment at Washington University School of Medicine’s Business Continuity CenterNightly offsite backups of Oracle databases. Recent Changes garystiehr@wustl.edu
Polycom 9101 video conferencing system being installed this week.For internal documentation and collaboration, migrated from PHPwiki to MediaWiki.Upgraded Linux systems from Debian Sarge to Debian Etch.Wireless network migrated to Cisco 4402 LAN controllers (from previous set of individually managed APs) Recent Changes garystiehr@wustl.edu
Division of Statistical Genomics evaluating SASGrid; GSC preparing to install and manage DSG purchase of 400 cores and 50 TB of disk to support new studies.New data center construction started; to be completed May 2008 (more on this Wednesday). Recent Changes garystiehr@wustl.edu