600 likes | 737 Vues
Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research. Mary Inaba , Makoto Nakamura, Kei Hiraki University of Tokyo. AWOCA 2003. Today’s Topic. New infrastructure for data intensive scientific research Problems of using the Internet.
E N D
Data Reservoir: Utilization of Multi-Gigabit Backbone Network forData-Intensive Research Mary Inaba, Makoto Nakamura, Kei Hiraki University of Tokyo AWOCA 2003
Today’s Topic • New infrastructure for data intensive scientific research • Problems of using theInternet
One day, I was surprised One professor (Dept. of Astronomy) said Network is for E-mail and paper exchange. FEDEX is for REAL Data exchange. (They use DLT tapes, and airplanes)
Huge Data Producers AKEBONO Sattelite High Energy Accelerator SUBARU telescope KAMIOKANDE (Novel Prize) Radio Telescope in NOBEYAMA A lot of Data suggest a lot of scientific truth, by computation. Now, we can compute. Data Intensive Research
Huge Data Transfer (inquiry to Profs.) Current State Data Transfer by DLT, EVERY WEEK. Expected Data Size in a few years 10GB/day for Satellite Data 50GB/day High Energy Accelerator 50PB tape archive for Earth Simulation Observatories are shared by many researchers, hence, NEED to bring data to Lab., somehow. Does Network help?
Super-SINET backbone Start 2002 Jan Network for Universities and Institute Combination of 10Gbps ordinary Line several 1Gbps Project Lines (physics, genome, Grid, etc.) Hokkaido Univ Tohoku Univ KEK, Kyoto Univ, TsukubaUniv Univ. Tokyo , Doshisha NAO, NII, Univ Titech, ISAS Waseda Nagoya B N C Osaka Univ, Univ Kyushu Okazaki Optical Univ Labs Cross-connect
Currently It is not so easy to transfer HUGE data by fully utilizing bandwidth for long distance, Because, TCP/IP is popularly used, for TCP/IP latency is the problem. Disk I/O speed (50MB/sec) …
Recall HISOTRYInfrastructure for Scientific Research Projects • Utilization of computing systems at the time • From the birth of a electronic computer • Numerical computation ⇒ Tables、Equations ① • Supercomputing(vector) ⇒ Simulation ② ③ • Servers ⇒ Database、Data-mining、Genome ④ • Internet ⇒ Information Exchange、 Documentation⑤ Scientific researchers always utilize top-end systems EDSAC CDC-6600 CRAY-1 SUN Fire15000 10G Switch ⑤ ① ② ③ ④
Frontier of Information Processing New transition period -- Balance of computing systems • Very high-speed network • Large scale disk storage New infrastructure for • Cluster computers Data Intensive Research CPU GFLOPS Network Interface Gbps Memory GB Remote Disks Local Disks
Basic Architecture High latency Very high bandwidth Network Data Reservoir Physically addressed Parallel and Multi-stream transfer Cache Disks Local file accesses Data Reservoir Local file accesses Distribute Shared File (DSM like architecture) Cache Disks
Data intensive scientific computation through SUPER-SINET X-ray astronomy Satellite ASUKA Nobeyama Radio Observatory (VLBI) Nuclear experiments BelleExperiments Data Reservoir Very High-speed Network Digital Sky Survey Distributed Shared files Data Reservoir SUBARU Telescope Data Reservoir Local Accesses CERN Data analysis at University of Tokyo
Design Policy Application • Modification of disk handler under VFS layer • Direct access to raw device for efficientdata transfer • Multi-level striping for scalability • Use of iSCSI protocol • Local file accesses through LAN • Global disk transfer through WAN • Single file image • File system transparency File System md (RAID) driver Data Server sd sg st SCSI driver iSCSI driver iSCSI daemon - sg - SCSI driver(mid) SCSI Driver(low) Disks
File accesses on Data Reservoir Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server
File accesses on Data Reservoir User’s View Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server
Scientific Detectors User Programs File Server File Server File Server File Server iSCSI Bulk Transfer IP Switch IP Switch Global Network Disk Server Disk Server Disk Server Disk Server Global Data Transfer
Implementation(File Server) Application System Call NFS EXT2 Linux RAID TCP/UDP sd Driver sg Driver IP iSCSI driver Network
Disk Disk Disk Implementation(Disk Server) Application Layer • System Call iSCSI daemon Data Stripe TCP dr Driver sg Driver IP iSCSI Driver SCSI Driver Network
Performance evaluation of Data Reservoir • Local experiment 1 Gbps model (basic performance) • 40 km experiments 1 Gbps model、U. of ⇔ ISAS • 1600 km experiments 1 Gbps model • 26ms latency (Tokyo ⇔ Kyoto⇔Osaka⇔Sendai⇔Tokyo) • High-quality network (SUPER-Sinet Grid project lines) • US-Japan experiments • 1Gbps model • U. of Tokyo ⇔ Fujitsu Lab. America (Maryland, USA) • U. of Tokyo ⇔ Scinet (Maryland, USA) • 10 Gbpsexperiments compare four different switch configuration • ExtremeSummit 7i, Trunked 8 Gigabit Ethernets • RiverStone RS16000 Trunked 8 and 121000BASE-SX • FoundryBigIron 10GBASE-LR modules • ExtremeBlackDiamond Trunked 8 1000BASE-SX • Foundry BigIron Trunked 2 10BASE-LR • the bottleneck (8Gbps) , Trunking 8 Gigabit Ethernets
Performance Comparison to ftp(40km) • ftp ---- Optimal performance (minimum disk head movements) • iSCSI – Queued operation • iSCSI transfer is 55% faster than ftp on single TCP stream
1600 km experiment System • 870 Mbps file transfer BW Univ. of Tokyo (CISCO 6509) ↓ 1G Ether (Super-SINET) Kyoto Univ (Extreme Black Diamond ) ↓ 1G Ether (Super-SINET) Osaka Univ. (CISCO 3508) ↓ 1G Ether (Super-SINET) Tohoku Univ. (Jumper fiber) ↓ 1G Ether (Super-SINET) Univ. of Tokyo (Extreme Summit 7i)
Network for 1600km experiments B I M ・ Grid project networks of SUPER-Sinet ・ One-way latency 26ms Tohoku Univ. I B M (sendai) 550mile 250mile Kyoto Univ. I B M I B M Univ. of I B M Tokyo 300mile Osaka I B M Univ. 1000mile line GbE
Transfer speed on 1600kmexperiment Maximum bandwidth by SmartBits = 970 Mbps Overheads of headers ~ 5% 1000 870 900 828 812 800 737 700 707 700 600 499 Transfer Rate (Mbps) 493 478 500 400 300 200 100 0 1*4*8 1*4*(2+2) 1*4*4 1*2*8 1*2*(2+2) 1*2*4 1*1*8 1*1*(2+2) 1*1*4 System configuration (file-servers * disk servers * disks/disk server)
10Gbps experiment 11.7 Gbps transfer BW • Local connection of two 10Gbps models • 10GBASE-LR or 8 to 12 1000BASE-SX • 24 disk servers + 6 file servers • Dell 1650, 1.26GHz PentiumIII×2 1GBmemory、ServerSet III HE-SL • NetGear GE NIC • Extreme Summit 7i (Trunking) • Extreme Black Diamond 6808 • Foundry Big Iron (10GBASE-LR) • RiverStone RS-16000
Performance on10Gbps model • 300GBytes file transfer (iSCSI streams) • 5% header loss due to TCP/IP, iSCSI • 7% performance loss due to trunking • Uneven use of disk servers 100GB file transfer in 2 minutes
US-Japan Experiments at SC2002 Bandwidth Challenge 92% Usage of Bandwidth using TCP/IP
User’s View Internet TCP is PIPE TCP TCP Input Data abcde Output Same Data In the same order Byte stream abcde
TCP’s View Internet TCP TCP abcde Check all data has come? Re-order when arrival order is wrong Ask “re-send” when data misses. Speed Control Byte stream abcde
Keep data until “Acknowledgement” arrives. Speed Control (Congestion Control) without knowing the state of routers. TCP’s feature Use Buffer (Window), and when get ACK from receiver new data is moved to buffer Make Buffer (Window) small, when congestion is guessed to be occurred.
Roughly speaking RTT: Round Trip Time Hence, Longer RTT needs Larger Window Size for same throughput. Window Size and Throughput Throughput = Window Size / RTT
Window Size Doubled for every ACK (start phase) AIMD phase time Congestion Control AIMD Additive Increase Multiplicative Decrease Gradually accelerate once after congestion occurs, Rapidly slow-down, when congestion is expected.
Another Problem Denote “network with long latency and wide bandwidth” as LFN(Long Fat Pipe Network) LFN needs large window size, But, since increment is triggered by ACK. speed of increment is also SLOW. (LFN suffers, AIMD)
Network Environment The Bottle Neck (about 600Mbps) Note that 600Mbps < 1Gbps
92% using TCP/IP is good,but, still we have a PROBLEM Several Streams work after other streams finish
Fastest and slowest streamin the worst case Sequence Number The slowest 3 times slower Than the fastest. Even other streams finish Throughput did not recover Time
Hand-made Tools • DR Gigabit Network Analyzer • Need accurate Time Stamp with 100ns accuracy • Dump full packets • Comet Delay and Drop Pseudo Long Fat Pipe Network(LFN) Gigabit Ether a packet is sent every 12 μsec
Unstable Throughput • We examined Long Distance Data Transfer, throughput is 8Mbps to 120Mbps. (When we use Gigabit Ethernet Interface)
Packet Distribution Number of Packets Per msec Time(sec)
Packet Distribution of Fast Ethernet Number of Packets Per msec Time(sec)
Gigabit Ethernet interfacev.s. Fast Ethernet interface Even, same “20Mbps”, Behavior of 20Mbps of Gigabit Ethernet Interface and 20Mbps of Fast Ethernet Interface Is completely different. Gigabit Ethernet is very bursty. Router might not like this.
2 problems • Once packets are sent burstly, router sometimes cannot bear. (Unlucky stream slow, lucky stream fast) Especially when bottleneck is under Gigabit. • More than 80% of time, the sender do not send anything.
Problem of implementation 1Gbps speed, suppose ether packet 1500B, 1 packet should be sent every 12 μsec. On the other hand, UNIX Kernel Timer is 10msec.
IPG(Inter Packet GAP) • Transmitter is always on, • When no packet sent, idle state. • Each Frame at least 12bytes IPG (IEEE 802.3) sender • Tunable by e1000 driver, (8bytes – 1023 bytes)
IPGtuning for short distance IPG 8bytes IPG 1023 bytes Fast Ethernet 94.1Mbps 56.7Mbps Gigabit Ethernet 941Mbps 567Mbps Suppose Ether Frame is 1500bytes, 1508: 2523 is approximately 567: 941 These work theoretically. (Gigabit ether has been perfectly tuned already for short distance data transfer)