Realization and Utilization of high-BW TCP on real application

Realization and Utilization of high-BW TCP on real application Kei Hiraki Data Reservoir / GRAPE-DR project The University of Tokyo

Computing System for real Scientists • Fast CPU, huge memory and disks, good graphics • Cluster technology, DSM technology, Graphics processors • Grid technology • Very fast remote file accesses • Global file system, data parallel file systems, Replication facilities • Transparency to local computation • No complex middleware, or no small modification to existing software • Real Scientists are not computer scientists • Computer scientists are not work forces for real scientists

Objectives of Data Reservoir / GRAPE-DR(1) • Sharing Scientific Data between distant research institutes • Physics, astronomy, earth science, simulation data • Very High-speed single file transfer on Long Fat pipe Network • > 10 Gbps, > 20,000 Km, > 400ms RTT • High utilization of available bandwidth • Transferred file data rate > 90% of available bandwidth • Including header overheads, initial negotiation overheads

Objectives of Data Reservoir / GRAPE-DR(2) • GRAPE-DR:Very high-speed attached processor to a server • 2004 – 2008 • Successor of Grape-6 astronomical simulator • 2PFLOPS on 128 node cluster system • 1G FLOPS / processor • 1024 processor / chip • 8 chips / PCI card • 2 PCI card / serer • 2 M processor / system

Data intensive scientific computation through global networks X-ray astronomy Satellite ASUKA Nuclear experiments Nobeyama Radio Observatory （VLBI) BelleExperiments Data Reservoir Very　High-speed Network Digital Sky Survey Distributed Shared files Data Reservoir SUBARU Telescope Data Reservoir Local Accesses Grape6 Data analysis at University of Tokyo

Basic Architecture High latency Very high bandwidth Network Data Reservoir Disk-block level Parallel and Multi-stream transfer Local file accesses Cache Disks Data Reservoir Distribute　Shared　Data (DSM like architecture) Local file accesses Cache Disks

File accesses on Data Reservoir Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server IBM x345 (2.6GHz x 2)

Scientific Detectors User Programs File Server File Server File Server File Server iSCSI Bulk Transfer IP Switch IP Switch Global Network Disk Server Disk Server Disk Server Disk Server Global Data Transfer

Problems found in 1st generation Data Reservoir • Low TCP bandwidth due to packet losses • TCP congestion window size control • Very slow recovery from fast recovery phase (>20min) • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • User and other network users have interests only to total behavior of parallel TCP streams

Fast Ethernet vs. GbE • Iperf in 30 seconds • Min/Avg: Fast Ethernet > GbE FE GbE

Packet Transmission Rate • Bursty behavior • Transmission in 20ms against RTT 200ms • Idle in rest 180ms Packet loss occurred

Packet Spacing • Ideal Story • Transmitting packet every RTT/cwnd • 24μs interval for 500Mbps (MTU 1500B) • High load for software only • Low overhead because of limited use at slow start phase RTT RTT/cwnd

Example Case of 8 IPG • Success on Fast Retransmit • Smooth Transition to Congestion Avoidance • CA takes 28 minutes to recover to 550Mbps

Best Case of 1023B IPG • Like Fast Ethernet case • Proper transmission rate • Spurious Retransmit due to Reordering

Unbalance within parallel TCP streams • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • Meaningless unfairness among parallel streams • User and other network users have interests only to total behavior of parallel TCP streams • Our approach • Constant Σcwnd i for fair TCP network usage to other users • Balance each cwnd i communicating between parallel TCP streams BW BW time time

3rd Generation Data Reservoir • Hardware and software basis for 100Gbps Distributed Data-sharing systems • 10Gbps disk data transfer by a single Data Reservoir server • Transparent support for multiple filesystems (detection of modified disk blocks) • Hardware(FPGA) implementation of Inter-layer coordination mechanisms • 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger

Utilization of 10Gbps network • A single box 10 Gbps Data Reservoir server • Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server) • Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth • Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger) PCI-X bus Quad Opteron Server (SUN V40z) Linux 2.6.6 10GBASE-SR Chelsio T110 TCP NIC 10G Ethernet Switch PCI-X bus Chelsio T110 TCP NIC PCI-X bus Ultra320SCSI SCSI adaptor PCI-X bus SCSI adaptor Data Reservoir Software

Tokyo-CERN experiment (Oct.2004) • CERN-Amsterdam-Chicago-Seattle-Tokyo • SURFnet – CA*net 4 – IEEAF/Tyco – WIDE • 18,500 km WAN PHY connection • Performance result • 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf • 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf • 8.8 Gbps disk to disk performance • 9 servers, 36 disks • 36 parallel TCP streams

CANARIE Calgary Amsterdam Vancouver Minneapolis CA*net 4 IEEAF Seattle Chicago SURFnet Geneva Tokyo Network used in the experiment End Systems A L1 or L2 switch Tokyo-CERN Network connection

Network topology of CERN-Tokyo experiment T-LEX StarLight Opteron server Dual Opteron248,2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Chelsio T110 NIC Chelsio T110 NIC Opteron server Dual Opteron248,2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Fujitsu XG800 12 port switch Fujitsu XG800 Tokyo Minneapolis Chicago Foundry NetIron40G IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux 2.6.6 (No.2-7) Linux 2.4.X (No. 1) IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux 2.6.6 (No.2-7) Linux 2.4.X (No. 1) 10GBASE-LW Extreme Summit 400 IBM x345 IBM x345 Seattle Vancouver Amsterdam CERN (Geneva) IBM x345 IBM x345 Foundry BI MG8 Foundry FEXｘ４４８ GbE GbE Pacific Northwest Gigapop NetherLight Data Reservoir at Univ. of Tokyo Data Reservoir at CERN(Geneva) WIDE / IEEAF CA*net 4 SURFnet

LSR experiments • Target • > 30,000 km LSR distance • L3 switching at Chicago and Amsterdam • Period of the experiment • 12/20 – 1/3 • Holiday season for vacant public research networks • System configuration • A pair of opteron servers with Chelsio T110 (at N-otemachi) • Another pair of opteron servers with Chelsion T110 for competing traffinc generation • ClearSight 10Gbps packet analyzer for packet capturing

CANARIE Calgary Amsterdam Vancouver Minneapolis CA*net 4 IEEAF/Tyco/WIDE Seattle Chicago SURFnet APAN/JGN2 Abilene NYC Tokyo Network used in the experiment A router or an L3 switch A L1 or L2 switch Figure 2. Network connection

Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo OME 6550 ONS 15454 Vancouver Calgary Router or L3 switch CANARIE ONS 15454 ONS 15454 L1 or L2 switch Minneapolis SURFnet Tokyo T-LEX Opteron1 IEEAF/Tyco ONS 15454 ONS 15454 OME 6550 Chelsio T110 NIC Chicago WAN PHY Opteron server University of Amsterdam WIDE Seattle Pacific Northwest Gigapop HDXc SURFnet Foundry NetIron 40G Force10 E1200 ONS 15454 Force10 E600 Opteron3 WAN PHY Pacific Ocean Chelsio T110 NIC Opteron server WIDE CISCO 6509 TransPAC Atlantic Ocean SURFnet Procket 8812 APAN/JGN Procket 8801 T640 OC-192 CISCO 12416 Chicago StarLight Fujitsu XG800 SURFnet Abilene OC-192 T640 HDXc SURFnet CISCO 12416 ClearSight 10Gbps capture SURFnet OC-192 Amsterdam NetherLight New York MANLAN Abilene Univ of Tokyo WIDE IEEAF/Tyco/WIDE CANARIE APAN/JGN2 SURFnet

Network Traffic on routers and switches StarLight Force10 E1200 University of Amsterdam Force10 E600 Abilene T640 NYCM to CHIN TransPAC Procket 8801 Submitted run

Summary • Single Stream TCP • We removed TCP related difficulties • Now I/O bus bandwidth is the bottleneck • Cheap and simple servers can enjoy 10Gbps network • Lack of methodology in high-performance network debugging • 3 day debugging (overnight working) • 1 day stable period (usable for measurements) • Network may feel fatigue, some trouble must happen • We need something effective. • Detailed issues • Flow control (and QoS) • Buffer size and policy • Optical level setting

Systems used in Long-distance TCP experimentsCERNPittsburghTokyo

Efficient and effective utilization of High-speed internet • Efficient and effective utilization of 10Gbps network is still very difficult • PHY, MAC, Data-link , and Switches • 10Gbps is ready to use • Network interface adaptor • 8Gbps is ready to use, 10Gbps in several months • Proper offloading, RDMA implementation • I/O bus of a server • 20 Gbps is necessary to drive 10Gbps network • Drivers, operating system • Too many interruption, buffer memory management • File system • Slow NFS service • Consistency problem

Difficulty in10Gbps Data Reservoir • Disk to disk Single Stream TCP data transfer • High CPU utilization (performance limit by CPU) • Too many context switches • Too many interruption from Network adaptor (> 30,000/s) • Data copy from buffers to buffers • I/O bus bottleneck • PCI-X/133 --- maximum 7.6Gbps data transfer • Waiting for PCI-X/266 or PCI-express x8 or x16 NIC • Disk performance • Performance limit of RAID adaptor • Number of disks for data transfer (>40 disks are required) • File system • High BW in file service is more difficult than data sharing

High-speed IP network in supercomputing(GRAPE-DR project) • World fastest computing system • 2PFLOPS in 2008 (performance on actual application programs) • Construction of general-purpose massively parallel architecture • Low power consumption in PFLOPS range performance • MPP architecture more general-purpose than vector architecture • Use of comodity network for interconnection • 10Gbps optical network (2008) + MEMs switches • 100Gbps optical network (2010)

FLOPS 30 10 27 10 1Y 1Z 1E 1P 1T 1G 1M Target performance Grape DR 2PFLOPS Parallel processors KEISOKU supercomputer 10PFLOPS Earth Simulator 40TFLOPS Processor chips 1K 256 64 16 Year 70 80 90 2000 2010 2020 2030 2040 2050

GRAPE-DRarchitecture • Massively Parallel Processor • Pipelined connection of a large number of PEs • SIMASD (Single Instruction on Multiple and Shared Data) • All instruction operates on Data of local memory and shared memory • Extension of vector architecture • Issues • Compiler for SIMASD architecture (currently developing – flat-C) Local Memory Integer ALU Floating point ALU 512 PEs ＧＦ CP ＋ On chip network On chip shared memory Shared memory Outside world

Hierarchical construction of GRAPE-DR メモリ 512PE/Chip 512 GFlops /Chip 2KPE/PCI board 2TFLOPS/PCI board 8 KPE/Server 8 TFLOPS/Server 2MPE/System 2PFLOPS/System １MPE/Node 1PFLOPS/Node

Network architecture inside a GRAPE-DR system IP storage system AMD based server Memory bus 100Gbps iSCSIサーバ光インタフェース Memory KOE MEMs based optical switch Highly functional router Adaptive compier Total system conductor For dynamic optimization Outside IP network

Fujitsu Computer Technologies, LTD

Realization and Utilization of high-BW TCP on real application

Realization and Utilization of high-BW TCP on real application

Presentation Transcript

TCP/IP Application Model

Characterization and Evaluation of TCP and UDP-based Transport on Real Networks

Characterization and Evaluation of TCP and UDP-based Transport on Real Networks

The Realization of GRAPES Model on China Meteorological Application Grid

TCP socket application

Example of Application Utilization for PBX sales

Application Layer of TCP/IP

BW on HANA

TCP-FCW – transport protocol for real-time transmissions on high-loss networks

Clinical application of TCP ・ NTCP

BW and BW Reporting Introduduction

TCP on High-Speed Networks

Elicitation and Utilization of Application-level Utility Functions

High-speed TCP

Application Design based on TCP or UDP

TCP/IP Application Layer

Utilization of High Bandwidth Channels

Characterization and Evaluation of TCP and UDP-based Transport on Real Networks

TCP-FCW – transport protocol for real-time transmissions on high-loss networks

Realization and Utilization of high-BW TCP on real application

Design and realization of Payload Operation and Application system of China’s Space Station

High-speed TCP